Version 0.4-draft-20240411
This document is organized as successive expositions at increasing levels of detail, to give the reader an idea of the motivations and high-level differences from conventional processor architectures, eventually getting down to the detailed definitions that direct SecureRISC processor execution. So, if the introductory material seems a little vague, that is because it attempts to sketch an overall context into which the details are later fit.
SecureRISC was created to develop old ideas and notes of mine. It is not a complete Instruction Set Architecture (ISA), but only the things I have had time to consider and work on. It is certainly not a specification. At present, this document only exists for discussion purposes.
The ISA is mostly just ideas at this point. The opcode assignments and instruction specifications are little more than hints. The Virtual Memory architecture needs work.
SecureRISC began as an exploration of what a security-conscious ISA might look like. I hope I can improve it over time to live up to its name. Should it turn into something more than an exploration, I would intend to make it an Open Source ISA, along the lines of RISC‑V.
There is no software (simulator, compiler, operating system, etc.) for SecureRISC. This is a paper-only spec at this point. A compiler, simulator, and FPGA implementation might be created at some point, but that is probably years in the future.
SecureRISC is an attempt to define a security-conscious Instruction Set Architecture (ISA) appropriate for server class systems, but which with modern process technology (e.g. 5 nm), could even be used for IoT computing given the die area for a single such processor is a small fraction of one mm2. I start with the assumption that the processor hardware should enforce bounds checking and that the virtual memory system should use older, more sophisticated security measures, such as those found in Multics, including rings, segmentation, and both discretionary and mandatory (non-discretionary) access control. I also propose a new block-structured instruction set that allows for better Control Flow Integrity (CFI) and performance. For performance, several features support highly parallel execution and latency tolerance, even in implementations that avoid highly speculative execution, which can lead to security vulnerabilities.
A comment about Multics is appropriate here. There seems to be an impression among many in the computer architecture world that many Multics features are complex. They are simple and general, easy to implement, and remove pressure on the architecture to add warts for specialized purposes. Computer architecture from the 1980s to the present is often an oversimplification of Multics. For example, segmentation in Multics served primarily to make page tables independent of access control, which is a useful feature that has been mostly abandoned in post-1980 architectures. Pushing access control into Page Table Entries (PTEs) puts pressure to keep the number of bits devoted to access control minimal when security considerations might suggest a more robust set. As another example, many contemporary processor architectures (e.g. RISC‑V) have two rings (User and Supervisor), with a single bit in PTEs (the U bit in RISC‑V) serving as a ring bracket. Having only two rings means a completely different mechanism is required for sandboxing rather than having four rings and slightly more capable ring brackets. Indeed, rings were not well utilized on Multics, but we now have more uses for multiple rings, including concurrent garbage collection, Just-In-Time (JIT) Compilation, and sandboxing.
The goals for SecureRISC in order of priority are:
wide issue)
Non-goals for SecureRISC include (this list will probably grow):
Security can mean many things. One of the most important is preventing unassisted infiltration (e.g. through exploiting buffer overflows, use-after-free errors, and other programming mistakes). Bounds checking is the primary defense against buffer overflows in SecureRISC. Another is preventing unintentionally assisted infiltration (e.g. phishing attacks installing trojans), which may be accomplished through mandatory access control. SecureRISC is not a comprehensive attempt at security but addresses the aspects that I think can be improved.
While I expect that mandatory (aka non-discretionary) access control is critical to computer security, at this point there is relatively little in SecureRISC’s architecture that enforces this (it is primarily left to software). However, I am still looking for opportunities in this area.
Security, garbage collection, and dynamic typing may appear to be orthogonal, but they are synergistic. SecureRISC attempts to minimize the impact of programming mistakes in several ways, such as making bounds checking somewhat automatic and making compiler-generated checking more efficient for disciplined languages where bounds checking is possible, and to keep pointers a single word, the architecture supports encoding the size in extra information per memory location. For undisciplined languages (e.g. C++) the compiler does not in general know the bounds that would be required to perform a check, and the two best methods so far invented to solve this also require some sort of extra information per memory location, such as the pointer and memory coloring used in AddressSanitizer or the tag bit in CHERI. AddressSanitizer uses an extra N bits per the minimum allocation unit (where that unit may be increased to reduce overhead) to detect errors with approximate probability 2−N. To address memory allocation error detection other techniques are necessary. One possibility is garbage collection (GC), which eliminates these errors, but to be a substitute for explicit memory deallocation, GC needs to be efficient, hence the goal synergy. Some implementations of GC are made more efficient by being able to distinguish pointers from integers that look like addresses at runtime, and some sort of tagging helps things. For languages requiring explicit deallocation of memory, AddressSanitizer may be used. However, AddressSanitizer on most architectures is too inefficient to use in production and is typically employed only during development as a debugging tool. SecureRISC seeks to make it efficient enough to use in production. CHERI accomplishes its extra bounds checking by implementing a 129‑bit capability encoding a 64‑bit pointer, 64‑bit base, 64‑bit length, type, and permissions (note the extra bit over each 64‑bit memory location required for making capabilities non-forgeable). Thus bounds checking, GC, or memory allocation error detection are all made possible or more efficient by having extra information per memory location. Since SecureRISC must support 64‑bit integer and float-point arithmetic, this extra information needs to be in addition to the 64 bits required for that data.
As justified above, SecureRISC targets its goals by what will likely be the most controversial aspect of SecureRISC: tags on words in memory and registers. The Basic Block descriptors may be more unusual, but the reader may come to appreciate this aspect of SecureRISC with familiarity (especially given the Control Flow Integrity advantages as a security feature), but the reader may in the end not find memory tags convincing because of the non-standard word size that results. An efficient and secure alternative is not known, and as a result, SecureRISC adds tags to memory locations. Tags simultaneously provide an efficient mechanism for bounding pointers, support use-after-free detection, support bounds checking with single-word pointers for undisciplined languages such as C++ (HWASAN or CHERI), and support more efficient Garbage Collection (the best solution to allocation errors), and also happen to support dynamically typed languages.
SecureRISC has not yet explored another use for tagging data, which is taint checking.
Before the reader dismisses SecureRISC because of tagged memory, consider the main memory options that SecureRISC processors are likely to support. Most contemporary processors use a cache block size of 576 bits (512 data bits plus 8 bits of ECC for every 64 bits of data), and provide efficient block transfers of this size between main memory and the processor caches by using interconnect of 72, 144, 288, or 576 bits. The equivalent natural width for SecureRISC is 640 bits (512 data bits, 64 tag bits, and 8 bits of ECC for every 72 bits of data and tag). However, there are multiple ways to provide the additional tag bits for SecureRISC, including the use of a conventional 576‑bit main memory. A simple possibility is to set aside ⅛ of main memory for tag storage. Misses from the Last Level Cache (LLC) would then do two main memory accesses, one reading 576 and then another access reading 72 bits (a total of 648 bits—the additional 8 bits the result of not sharing ECC over tags and data).* (There might be a specialized write-thru cache for the ⅛ of main memory after the LLC reserved for tag block read to exploit locality, but the coherence of this would need to be figured out.) Support for encryption of data in memory is a goal of SecureRISC, and good encryption requires the storage of authentication bits, increasing the size of cache blocks stored in main memory. The encryption proposed for SecureRISC encrypts 512 bits of data, 64 bits of tag into 704 bits of encrypted authenticated ciphertext, and then appends 64 bits of ECC (8 bits per 88) giving a 768‑bit memory storage, which conveniently fits three non-ECC DIMM widths. Alternatively, in a system with 512‑bit main memory, 1.5 main memory blocks could be used for a SecureRISC cache block (e.g. three transfers of 256 bits or six of 128 bits or twelve of 64 bits). Thus the cost for encrypted and tagged memory is the difference between two ECC DIMMs and three non-ECC DIMMs.
* If the system interconnect fabric is wide enough to support it (AMBA Coherent Hub Interface (CHI) may have support for this?), it may be preferable to move the read of the ≈⅛ of main reserved for tags into the memory controller, and then supply cache blocks with tags throughout the rest of the system.
The above is summarized in the follow table:
Data | Tags | Enc | ECC | Total | Organization | Type | Use |
---|---|---|---|---|---|---|---|
Cached, Tagged | |||||||
512 | 64 | 128 | 64 | 768 |
96×8, 192×4, …, 768×1 or 64×12, 128×6, 256×3 |
Main | All |
512 | 64 | 64 | 640 | 80×8, 160×4, …, 640×1 | Main | All unencrypted | |
512 | 64 | 72 | 648 |
72×8, 144×4, …, 576×1 + 72×1 |
Main |
All unencrypted ≈⅛ of main reserved for tags[1][3] |
|
512 | 64 | 128 | 88 | 792 |
72×8, 144×4, …, 576×1 + 72×3 |
Main | All (≈⅓ of main reserved for tags + encryption)[2][4] |
Cached, Untagged | |||||||
512 | 64 | 576 | 72×8, 144×4, …, 576×1 | I/O | Data only (no pointers or code) | ||
512 | 128 | 64 | 704 | ? | ? | Encrypted data only | |
Uncached | |||||||
n.a. | 8, 16, 32, 64, 128 | I/O | Registers |
Footnotes:
It may be possible to add tags selectively to portions of memory. For
example, slab allocators are typically page based. Thus one would
direct the processor to read tags just from the beginning or end of the
page. For example, the tag for vaddr
might be read
from vaddr63..12 ∥ 03 ∥ vaddr11..3
and the slab allocator made aware to start is allocation at offset 512
in pages, so tags are stored at offsets 32..511 (0..31 not being used as
tags on tags are not required—these offset are available for
allocator overhead). A Page Table Entry (PTE) bit might indicate this
form of tag storage is in use. Separate mechanisms for bage
tags, stack tags, and slab allocations larger than a page would still be
required.
The above discussion suggests at least five different uses of memory tags:
While memory tagging is useful for the above, it is used in different ways for the above. Instead of a single unified mechanism, SecureRISC uses memory tagging in two ways, one for AddressSanitizer, and then combining CHERI and disciplined language support into the other.
Is SecureRISC Reduced Instruction Set Computing
? It is certainly
not a small instruction set, but RISC no longer stands for that, but has
been primarily a marketing term. As one wag put it, RISC is any
instruction set architecture defined after 1980. A more accurate
description might be ISAs suitable as advanced compiler targets, as the
general trend is to depend on the compiler to exploit features of the
ISA, such as redundancy elimination, sophisticated register allocation,
instruction scheduling, etc. Such things have generally favored ISAs
organized along the load and store model and simple addressing modes.
By this criterion, I believe SecureRISC is a RISC architecture, but it is
not a simplistic or reduced instruction set. Contemporary processors,
even for simple instruction sets, are very complex, and that complexity
will probably grow until Moore’s Law fails. The design challenges
are large. In the contemporary world, simplicity is a goal when it
furthers other goals such as performance (e.g. by maximizing clock
rate), efficiency (e.g. by reducing power consumption), and so on.
The original motivation for block-structured ISAs was Instruction-Level
Parallelism (ILP) studies that I did back in 1997 at SGI that showed
that instruction fetch was the limiting factor in ILP. This was before
modern branch prediction,
e.g. TAGE,
so that result may no longer be true. The idea was that instruction
fetch is analogous to linked list processing, with parsing at each list
node to find the next link. Linked list processing is inherently slow
in modern processors, and with parsing it is even worse. I wanted to
replace linked lists with vectors (i.e. to vectorize instruction fetch),
but couldn’t figure out how, and settled for reducing the parsing
at each list node. I still feel that this is worthwhile, but the exact
tradeoffs might require updating older work in this area. The best
validation of this dates from 2007,
when Professor Christoforos Kozyrakis
convinced his Ph.D. student
Dr. Ahmad Zmily
to look at this approach in a Ph.D. thesis. In the introduction of
Block-Aware Instruction Set Architecture,
Dr. Zmily wrote,
We demonstrate that the new architecture improves upon conventional
superscalar designs by 20% in performance and 16% in energy.
Such an advantage is not enough on which to foist a new ISA upon the
world, but it encourages me to think that it does provide an impetus
for using such a base when creating a new ISA for other purposes, such
as security. Since 2007, improvements in the proposed block-structured
ISA should result in greater performance improvements, while
improvements in branch prediction (e.g. TAGE predictors) decrease some
of the advantages. Also, Dr. Zmily’s work was based on the MIPS
ISA, and SecureRISC is quite different in many aspects. Should
SecureRISC be developed to the point where binaries can be produced and
simulated, a more appropriate performance estimate will be possible.
Before starting SecureRISC, my previous experience was with the many ISAs and operating systems. Long after starting my block-structured ISA thoughts, I became involved in the RISC‑V ISA project. RISC‑V is in many ways a cleaned-up version of the MIPS ISA (e.g. minus load and branch delay slots) and it seems likely to become the next major ISA after x86 and ARM. Being Open Source, RISC‑V has easy-to-access documentation. As such I have used it for comparisons in the current description of SecureRISC and modified some of its virtual memory model to be slightly more RISC‑V compatible (e.g. choosing the number of segment bits to be compatible with RISC‑V Sv48). However, most aspects of the SecureRISC ISAs predate my knowledge of RISC‑V and were not influenced by it, except that I found that RISC‑V’s Vector ISA was more developed than my thoughts (which were most influenced by the Cray-1, which supported only 64‑bit precision).
In 2022 I encountered the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) research effort. I found their work impressive, but I had concerns about the practicality of some aspects. Despite my concerns, I thought that SecureRISC might be a good platform for CHERI, so I have extended SecureRISC to outline how it might support CHERI capabilities as an exploration. I also modified SecureRISC’s sized pointers to include a simple exponent to extend the size range based on ideas from CHERI but kept them single words by not including both upper and lower bounds. This sized pointer is not as capable as a CHERI pointer, but it is 64 bits rather than 128 bits, which has the advantage of size. There is a more detailed discussion of CHERI and SecureRISC below.
In 2023 I took the virtual memory ideas in SecureRISC and created a proposal for RISC‑V tentatively called Ssv64. I made Ssv64 much more RISC‑V compatible than SecureRISC had been (e.g. in PTE formats), and have recently been backporting some of those changes into SecureRISC since there is no reason to be needlessly different.
SecureRISC does depend upon a few new microarchitecture structures to realize its potential. There should be a Basic Block Descriptor Cache (BBDC), though this could be thought of as an ISA-defined Branch Target Buffer (BTB). The BBDC is in addition to the usual L1 Instruction Cache. While the BTB and BBDC are similar, the BBDC is likely to be sized such that it requires more than one cycle to access (resulting in a target prediction in two cycles), making another structure useful (in the An Example Microarchitecture section at the end this is called a Next Descriptor Index Predictor) to enable a new basic block to be fetched every cycle by providing just the set index bits one cycle early. The most novel new microarchitecture structure suggested for SecureRISC is the Segment Size Cache, which caches the segment size log2 for a segment number, which is used for segment bounds checking on the base register of loads. This cache might also provide the GC generation number of the segment (TBD). While these are new structures, in the context of a modern microarchitecture, especially one with two or three levels of caches and a vector unit, they are tiny and worthwhile.
Some things remain unchanged from other RISCs. Addresses are byte-addressed. Like other RISC ISAs, SecureRISC is mostly based upon loads and stores for memory access. Integers and floating-point values have 8, 16, 32, or 64‑bit precision. Floating-point would be IEEE-754-2019 compatible. The Vector ISA will probably be similar to the RISC‑V Vector ISA but might however use the 48‑bit or 64‑bit instruction format to do more in the instruction word and less with vset (perhaps a subset of vector instructions would exist as the 32‑bit instructions). Also, there are multiple explicit vector mask registers, rather than using v0. (There are sixteen vector mask registers, but only m0 to m3 are usable to mask vector operations in 32‑bit vector instructions—the others primarily exist for vector compare results.)
Readers will have to decide for themselves whether the proposed virtual memory is conventional because it is somewhat similar to Multics, or unconventional because it is different from RISC ISAs of the last forty years. A similar comment could be made concerning the register architecture since it echoes the Cray-1 ISA from 1976 but is somewhat different from RISCs since the 1980s. (The additional register files in SecureRISC serve multiple purposes, but an important one is supporting execution of many instructions per cycle without the wiring bottleneck that a single unified register file would create.)
The reader of this document is unlikely to need the following glossary, but just in case, here are the terms and abbreviations used herein. Most readers will want to skip but come back here if some acronym or concept is unfamiliar.
performed to gain access to a set of shared locations, and Release accesses (typically stores) are performed to grant access to sets of locations. The paper went on to introduce release consistency.
Bfor bytes or
bfor bits, as in 4 KiB for 4,096 bytes, or 2 MiB for 2,097,152 bytes.
Prefix | Value | ||
---|---|---|---|
Ki | 1024 | 210 | 1,024 |
Mi | 10242 | 220 | 1,048,576 |
Gi | 10243 | 230 | 1,073,741,824 |
Ti | 10244 | 240 | 1,099,511,627,776 |
Pi | 10245 | 250 | 1,125,899,906,842,624 |
Ei | 10246 | 260 | 1,152,921,504,606,846,976 |
Zi | 10247 | 270 | 1,180,591,620,717,411,303,424 |
Yi | 10248 | 280 | 1,208,925,819,614,629,174,706,176 |
hitif the line containing the access is stored in the cache, and a
missif it is not; cache misses result in the a line-sized read from the next level of the cache hierarchy. A cache miss may also require
evictionof some other cache line to make room to store the incoming data. Caches handle writes in different ways: a write-through cache writes store data to the cache and also sends it to the next level of the hierarchy; write-back caches store data in the cache and mark the cache line as
dirty, meaning that the cache line will have to eventually be written back to higher levels of the cache hierarchy (e.g. on eviction). Caches may be fully associative (a block of data may be located in any cache location), or N-way set-associative (the set of N locations for a given block of data is determined by a few address bits—the set index) and only the N ways need to be searched for a match. Cache blocks have an associated tag, which is typically the address bits not used in the set index, though in some cases tags and indexes may be hashed.
evictionto make room for the new data. The Cache replacement policy is the algorithm that determines which location in the cache (e.g. which way of an N-way set associative cache) is evicted and used to store the incoming data. The optimal policy is to replace the block that will be used furthest in the future, which is generally not known, and so other simpler algorithms are typically used, such as Least Recently Used (LRU) Pseudo-LRU, and Re-Reference Interval Prediction (RRIP).
Stop The WorldGC).
extents. Large allocations are allocated directly as extents. The name derives from
Jason Evans’ malloc.
A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress.Such algorithms are important motivators for the atomic operations, such as Compare-and-Swap.
count—Intel’s
Instruction Pointeris better. In SecureRISC it is the address for Basic Block Descriptor fetch, which makes it even less descriptive, but out of tradition, SecureRISC still keeps this name.
physical addressis a potentially confusing term, being dependent upon context.
stubin a an area called the PLT, which contains code that can invoke the dynamic linker on the first call to resolve the target. After the resolution, the PLT stub may be patched (depending on the ISA) so that future calls skip the dynamic linker. If the PLT stubs are patched, then the PLT must be writeable, but since such PLT entries are grouped together, this results in relatively few pages being unshared.
population count, which is another name for the formal Hamming weight. While Hamming weight is defined more generally, popcount as used in this document is simply the number of 1 bits in bitstring or word.
logical registers) to locations in a larger execution register file to eliminate needless restrictions on instruction execution order that would be required if logical register numbers were used directly.
writes issuing from any processor may not be observed in any order other than that in which they were issued.
Really Invented by Seymour Craydue to Cray’s contribution to ISA design on the CDC 6600 and the Cray-1. John Mashey’s once opined that the definition of RISC had become any instruction set defined after 1980, given the number of things that have been called RISC thereafter that were more complicated than the ISAs that first embraced the RISC terminology (SPARC and MIPS). See Is SecureRISC actually RISC? for my thoughts on a useful definition of RISC.
slab allocatorcan refer to a generic class of memory allocation algorithms, or a specific Linux kernel allocator (currently being deprecated in favor of the SLUB allocator). SecureRISC uses the generic class meaning, which is where allocation requests are binned into size groups, and for small sizes, and for a given size group there is a set of pages containing objects of that size. Allocation is just removing from the free list, and deallocation is simply adding to the free list. Large sizes are handled with page allocation. An example of a slab allocator is jemalloc. Slab alloctors interact with memory safety is multiple ways, and provide a good opportunity to exploit tagging to detect buffer overflows and dangling pointer errors.
Orange Book. While it has been superseded by the Common Criteria for Information Technology Security Evaluation, there are still relevant concepts from the Orange Book, such as Mandatory access control, which shows up in many products, such as Security-Enhanced Linux (SELinux).
auto
keyword or simply omitting
the static
keyword). These allocations are to either
the stack or registers and always thread-local because stacks and
registers are thread-local.
new
and delete
). These
allocations are done on one or more areas or heaps, and are always
shared by all threads because the heap is shared by all threads.
static
keyword). These allocations are to ELF
sections such as .bss
and .data
. A
single copy is shared by all threads.
__thread
keyword). These allocations are
to ELF sections such as .tbss
and .tdata
. Each thread gets exactly one copy of
this data. This allocation may occur when a thread is created
(initialized from the template created by the compiler) or
dynamically on the first use. This storage is called Thread-Local
Storage (TLS). See
also GCC Thread-Local Storage
and Ulrich Drepper’s ELF Handling For Thread-Local Storage.
lookasideinvolved).
rings). As examples of privilege nesting, user mode would typically not have access to supervisor memory, but supervisor mode would have access to user memory, and user mode would not have access to certain instructions and registers that supervisor mode can access.
natural unit of datafor a processor, but often contemporary ISAs define word in a historical way, e.g. to be 32 bits even when their datapaths make 64 bits more natural. They do so because their predecessor ISAs had 32‑bit words. Thus, word is term that is simply defined by each new ISA. The SecureRISC definition is given below.
Other terminology and acronyms are associated with cryptography and are summarized below. The reader should return here if encountering unfamiliar cryptographic terminology.
block). The key length need not be the same as the data length. The input to encryption is called plaintext, and the output is called ciphertext. Decryption with the same key takes ciphertext and produces the original plaintext. This is represented as follows for encryption algorithm E, decryption algorithm D, plaintext P, ciphertext C, key K:
lightweight(e.g. compared to AES) block cipher optimized for hardware. The Simonm/n family employs a Feistel structure to encrypt m‑bit blocks with a n‑bit key.
cryptographic protection of data stored in constant length blocks. The encryption of the 𝑗th 128 bits of a block with tweak 𝑖 is as follows where the block length is a multiple of 128 bits (i.e. the following does not cover ciphertext stealing):
⊕ | Bitwise xor |
⊗ | Multiplication of two polynomials over the binary field GF(2) mod 𝑥128+𝑥7+𝑥2+𝑥+1 |
Key |
Is a two-part encryption key, consisting of Key1
and Key2 where Key = Key1∥Key2. For AES-128, Key would be 256 bits, and for AES-256 it would be 512 bits. |
𝑖 | is the value of 128-bit tweak |
P𝑗 | is the 𝑗th block of 128 bits (the plaintext) |
C𝑗 | is the 𝑗th block 128 bits of ciphertext for P𝑗 |
Other acronyms may be less familiar as they come
from RISC‑V.
To quote from RISC‑V
International’s About RISC‑V:
RISC-V is an open standard Instruction Set Architecture (ISA) enabling a new era of processor innovation through open collaboration.
The official RISC‑V ISA specifications may be downloaded
from RISC‑V specifications
while working versions may be found at
the GitHub RISC‑V ISA Manual repository.
The two primary specifications are:
Memory type | Vacant, Main memory, I/O |
Read access width | Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit, …, 512‑bit reads supported. |
Write access width | Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit, …, 512‑bit writes supported. |
Execute access width | Subset of 16‑bit, 32‑bit, 64‑bit, 128‑bit, …, 512‑bit instruction fetch supported. |
Atomic Memory Swap (AMOSwap) width |
Subset of 8‑bit, 16‑bit, 32‑bit,
64‑bit, 128‑bit
AMOSwaps supported. |
Atomic Memory Logical (AMOLogical) width |
Subset of 8‑bit, 16‑bit, 32‑bit,
64‑bit, 128‑bit
AMOLogicals supported. |
Atomic Memory Arithmetic (AMOArithmetic) width |
Subset of 8‑bit, 16‑bit, 32‑bit,
64‑bit, 128‑bit
AMOArithmetics supported. |
Page Table Reads | Supported or not. |
Page Table Writes | Supported or not. |
LR/SC support level | None, NonEventual, Eventual. |
Coherence | Not coherent or coherence channel number |
Cacheability | Yes or No |
Idempotency | Whether reads and writes have side effects. |
Worlds) assigned to system interconnect ports (e.g. processors and devices) that are checked in a distributed fashion by resources (such as memories and peripheral devices) or checkers located before such resources. Worlds are created and configured by a Trusted Execution Environment (TEE), usually at system boot time. A two-world simplification of WorldGuard could be used to provide similar functionality to ARM’s TrustZone.
Write Any Values, Reads Legal Values, which is a specification of a register field that allows processors to implement a subset of the functionality described in a CSR (e.g. by hardwiring certain bits to fixed values, or by allowing some values to be written but not others). If an unsupported value is written, the implementation substitutes some legal value.
Writes Preserve values, Reads Ignore Values. The RISC‑V privileged specification states,
Software should ignore the values read from these fields, and should preserve the values held in these fields when writing values to other fields of the same register. For forward compatibility, implementations that do not furnish these fields must make them read-only zero.
There are so many security vulnerabilities that instruction set and microarchitecture design should be aware of that these are now separated from the conventional glossary above.
Gather Data Sampling (GDS)by Intel.
targeting Intel SGX technology which defeats enclave memory isolation, sealing, and attestation guarantees.Intel calls Foreshadow a
L1 Terminal Fault (L1TF)vulnerability. Intel’s analysis
identified two closely related variants of Foreshadow, which we collectively call Foreshadow-NG(quotes from Foreshadow-NG). These attacks allow the entire L1 data cache to be dumped, which potentially exposes data from other address spaces that otherwise not be nameable for leaking by other techniques:
Foreshadow-NG-type attacks variants exploit a subtle L1TF microarchitectural condition that allows to transiently compute on unauthorized physical memory locations that are currently not mapped in the attacker’s virtual address space view. As such, Foreshadow-NG is the first transient execution attack that fully escapes the virtual memory sandbox-traditional page table isolation is no longer sufficient to prevent unauthorized memory access.
Our key finding is that all the common synchronization primitives can be microarchitecturally bypassed on speculative paths, turning all architecturally race-free critical regions into Speculative Race Conditions (SRCs).
transient execution attack that leaks arbitrary data on all AMD Zen CPUs.
is a transient-execution attack which observes the values of memory loads and stores on the current CPU core. ZombieLoad exploits that the fill buffer is used by all logical CPUs of a CPU core and that it does not distinguish between processes or privileges.
Much more in SecureRISC is unconventional. To prepare the reader to put aside certain expectations, we list some of these things here at a high level, with details in later sections.
a[i]
loads or stores to that location
only after checking that i
is within the bounds
specified in the array pointer. C++ *p++
sort of
programming is less amenable to SecureRISC bounds checking and
must use either CHERI-128 pointers with bottom and top encoded in
addition to the pointer itself or use the alternative
AddressSanitizer memory tag method of bounds checking.
for i ← a to b
(where the loop iteration count
is b − a + 1
)
and for i ← a to b step -1
(where the loop iteration count
is a − b + 1
). The loop
may be exited early with a conditional branch; only the loop back
is predicted with the hint.
63 | 48 | 47 | 0 | ||||||||
segment | fill | tableindex0 | offset | ||||||||
16 | 48−SS | PTS | SS−PTS |
The Basic Block (BB) descriptor aspect listed above is perhaps the most unfamiliar. Below are some of the rationale and advantages of this approach.
Contemporary processors have various structures that are created and updated during program execution to improve performance, such as Caches, TLBs, Branch Target predictors (BTB), Return Address predictors (RAS), Conditional Branch predictors, Indirect Jump and Call predictors, prefetchers, and so on. In SecureRISC one of these is moved into the ISA for performance and security. In particular, the BTB becomes a Basic Block Descriptor Cache (BBDC). The BBDC caches lines of Basic Block Descriptors that are generated by the compiler, in a separate section from the instructions. SecureRISC also seeks to make the Return Address predictor more cache-like and build in some additional ISA support for loop prediction.
fall throughto subsequent descriptors, but each has a pointer to the instructions to fetch, and so the instruction blocks of a bage could simply be sorted by frequency, placing the hottest first and the coldest last, or some similar arrangement*, all without introducing new instructions or changing anything other the pointers in the descriptors.
I started with the assumption that pointers are a single word, which are
expanded based on the 8‑bit tag to a base and size when loaded
into the doubleword (144‑bit) Address Registers
(ARs). This enables automatic bounds
checking. The effective address calculation uses
the ARs base to check the offset/index
value against the size. This supports programs
oriented toward a[i]
pointer usage, but not
C++ *p++
pointer arithmetic (such arithmetic is possible in
SecureRISC at the expense of bounds checking).
In contrast, the University of Cambridge
Capability Hardware Enhanced RISC Instructions (CHERI)
Project started with the assumption that capability pointers are four
words (including lower and upper bounds, the pointer itself, and
permissions and object type), and invented a compression technique to
get them down to two words. SecureRISC can support CHERI by using its
128‑bit AR
load and store instructions to transfer capabilities to and from the
144‑bit ARs, and therefore be
able to accommodate either singleword or doubleword pointers. Support
for the CHERI bottom and top decoding, its permissions, and its
additional instructions would be required. The CHERI tag bit is
replaced with two SecureRISC reserved tag values (one tag value in word
0, another in word 1). I would expect languages such as Julia and Lisp
would prefer singleword pointers, so supporting both singleword and
doubleword pointers allows both to exist on the same processor
depending on the instructions generated by the compiler.
Unlike CHERI, SecureRISC pointers have only a size and not bottom and top values encoded. As a result, SecureRISC’s bounds checking is more suited to situations where indexing from a base is used rather than incrementing and decrementing pointers, and so bounds checking is better suited to disciplined languages, primarily ones that emphasize array indexing over pointer arithmetic. My expectation is that running some C++ code on would be possible with bounds checking, but pointer-oriented C++ code would fail bounds checking. Bounds checking is a better target for Rust, Swift, Julia, or Lisp. SecureRISC can use unsized pointers for C++, but using these would represent a less secure mode of operation. The supervisor would need to enable on a per process basis whether such C++ pointers can be used; if disabled they would cause exceptions. For example, a secure system might only allow C++ pointers for applications without internet connectivity. Instead, undisciplined languages (such as C++) are likely to either use CHERI-128 pointers or memory and pointer cliques for security.
Tagged memory words are separable from other aspects of SecureRISC, such as the Multics aspects and the Basic Block descriptor aspects. One could imagine a version of SecureRISC without the tags and a 64‑bit word (72 bits with ECC in memory). Even in such a reduced ISA—call it SemiSecureRISC—I would keep the AR/XR/SR/VR model. SemiSecureRISC is still interesting for its performance and security advantages, but I do not plan to explore it. There is also the possibility of combining SemiSecureRISC with CHERI and its 1‑bit tag, since the CHERI project has done a lot of important software work. Call such an ISA BlockCHERI. I suspect the CHERI researchers would say that the only advantage in BlockCHERI would be the performance advantage of the Block ISA and the AR/XR/SR separation, with the ARs specialized for CHERI capabilities, and the XRs/SRs for non-capability data. My primary thought on the BlockCHERI is that the difference between a 65‑bit memory (73 bits with ECC) and 72‑bit memory (80 bits with ECC) may find that 7 extra bits may be put to good use.
One could imagine variants of SecureRISC that have only some of its features:
Name | Block ISA | Segmentation | Rings | Tags | CHERI | Word | Pointer |
---|---|---|---|---|---|---|---|
SecureRISC | ✔ | ✔ | ✔ | ✔ | ✔ | 72 | 72/144 |
SemiSecureRISC | ✔ | ✔ | ✔ | 64 | 64 | ||
BlockRISC | ✔ | 64 | 64 | ||||
BlockCHERI | ✔ | ? | ? | ✔ | 65 | 130 |
As I indicated earlier, I don’t think that BlockRISC is sufficient to justify a new ISA. I am concentrating on the full package.
I need to think more carefully about I/O in a SecureRISC system. Some I/O will be done in terms of byte streams transferred via DMA to/and from main memory (e.g. DRAM). Such I/O if directed to tagged main memory writes the bytes with an integer tag. Similarly, if processors use uncached writes of 8, 16, 32, or 64 bits (as opposed to 8‑word blocks) to tagged memory, the memory tag must be changed to integer. Tag-aware I/O of 8‑word units exists and may be for paging and so forth. It may be that a general facility for reading tagged memory including the tag as a stream of 8‑bit bytes could be provided along with a cryptographic signing, and for writing such a stream back with signature checking will be useful.
Ports onto the system interconnect fabric will have to have rights and permissions assigned by the hypervisor, and perhaps hypervisor guests. This needs to be worked out.
Being able to support DMA from lower privilege rings (user-mode
)
would be desirable, but it seems difficult to make this work, because
then the user ring code would be sending its own local virtual addresses
to the I/O device for DMA, and so the I/O devices would have to be able
to translate user addresses to system interconnect addresses via
two-level page tables and user-mode would have to tell the I/O device
the page table the supervisor assigned it, which it doesn’t know.
I am for now leaving user-mode I/O unaddressed. One possibility is to
implement a 80 to 96‑bit global address space by converting the
existing 12‑bit per-processor translation cache flush optimization
(ASID) to a system-wide 16 to 32‑bit Address Space ID and start
the segment descriptor lookup from this extended virtual address. This
would allow I/O to locate user-mode page tables. The cost is wider
address matching in translation caches and potentially multi-level walks
to read segment descriptors (probably with additional caching).
Little Endian bit numbering is used in this documentation (bit 0 is the least significant bit). While not a documentation convention, I might as well mention up front that SecureRISC is similarly Little Endian in its byte addressing.
Links to Wikipedia articles are followed with a icon. Links to documents in PDF format are followed by a .
To augment English descriptions of things, SecureRISC uses notation that operates on bits. This notation is sketched out here, but it is still only a guide to the reader (i.e. it is not a complete formal specification language such as SAIL). Its advantage is brevity.
For those familiar with Multics, the primary thing to know is that SecureRISC has up to 8 rings (0..7) and inverts ring numbers so that ring 0 is the least privileged. Also, segment sizes are powers of two.
domainswhere permissions were specified without nesting. This is straight-forward, until the procedure for evaluating permissions of reference parameters using the privilege of the calling domain is attempted. SecureRISC does not attempt to generalize rings to domains due to this complexity.
One additional change to Multics rings may be to require access to segments at lower privilege level than PC.ring by more privileged rings to use pointers with ring tags (192..199). A reference using a non-ring pointer might cause an exception, making it difficult to accidentally trick privileged rings to use untrusted data.
To illustrate the utility of rings, the following example shows how all 8 rings might be used. Indeed, if there were one more ring available, it might be used for the user-mode dynamic linker, so that links are readable by applications, but not writeable.
What | R1,R2,R3 | seg RWX |
R b | W b | X b | G b | Ring 0 | Ring 1 | Ring 2 | Ring 3 | Ring 4 | Ring 5 | Ring 6 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
User code | 2,2,2 | R-X | [2,7] | - | [2,2] | - | ---- | ---- | R-X- | R--- | R--- | R--- | R--- |
User execute only | 2,2,2 | --X | - | - | [2,2] | - | ---- | ---- | --X- | ---- | ---- | ---- | ---- |
User stack or heap | 2,2,2 | RW- | [2,7] | [2,7] | - | - | ---- | ---- | RW-- | RW-- | RW-- | RW-- | RW-- |
User read-only file | 2,2,2 | R-- | [2,7] | - | - | - | ---- | ---- | R--- | R--- | R--- | R--- | R--- |
User return stack | 4,2,4 | RW- | [2,7] | [4,7] | - | - | ---- | ---- | R--- | R--- | RW-- | RW-- | RW-- |
Compiler library | 7,0,0 | R-X | [0,7] | - | [0,7] | - | R-X- | R-X- | R-X- | R-X- | R-X- | R-X- | R-X- |
Super driver code | 3,3,3 | R-X | [3,7] | - | [3,3] | - | ---- | ---- | ---- | R-X- | R--- | R--- | R--- |
Super driver data | 3,3,3 | RW- | [3,7] | [3,7] | - | - | ---- | ---- | ---- | RW-- | RW-- | RW-- | RW-- |
Super code | 4,4,4 | R-X | [4,7] | - | [4,4] | - | ---- | ---- | ---- | ---- | R-X- | R--- | R--- |
Super heap or stack | 4,4,4 | RW- | [4,7] | [4,7] | - | - | ---- | ---- | ---- | ---- | RW-- | RW-- | RW-- |
Super return stack | 6,4,6 | RW- | [4,7] | [6,7] | - | - | ---- | ---- | ---- | ---- | R--- | R--- | RW-- |
Hyper driver code | 5,5,5 | R-X | [5,7] | - | [5,5] | - | ---- | ---- | ---- | ---- | ---- | R-X- | R--- |
Hyper driver data | 5,5,5 | RW- | [5,7] | [5,7] | - | - | ---- | ---- | ---- | ---- | ---- | RW-- | RW-- |
Hyper code | 6,6,6 | R-X | [6,7] | - | [6,6] | - | ---- | ---- | ---- | ---- | ---- | ---- | R-X- |
Hyper heap or stack | 6,6,6 | RW- | [6,7] | [6,7] | - | - | ---- | ---- | ---- | ---- | ---- | ---- | RW-- |
Hyper return stack | 6,6,6 | RW- | [6,7] | [6,7] | - | - | ---- | ---- | ---- | ---- | ---- | ---- | RW-- |
Hyper gates for supervisor | 6,6,4 | R-X | [6,7] | - | [6,6] | [4,5] | ---- | ---- | ---- | ---- | ---G | ---G | R-X- |
TEE code | 7,7,7 | R-X | [7,6] | - | [7,7] | - | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
TEE data | 7,7,7 | RW- | [7,6] | [7,6] | - | - | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
Sandboxed JIT code | 1,0,0 | RWX | [0,7] | [1,7] | [0,1] | - | R-X- | RWX- | RW-- | RW-- | RW-- | RW-- | RW-- |
Sandboxed JIT stack or heap | 0,0,0 | RW- | [0,7] | [0,7] | - | - | RW-- | RW-- | RW-- | RW-- | RW-- | RW-- | RW-- |
Sandboxed non-JIT code | 1,1,1 | R-X | [1,7] | - | [1,1] | - | ---- | R-X- | R--- | R--- | R--- | R--- | R--- |
User gates for sandboxes | 2,2,0 | R-X | [2,7] | - | [2,2] | [0,1] | ---G | ---G | R-X- | R--- | R--- | R--- | R--- |
SecureRISC implements two levels of address translation, as in processors with hypervisor support and virtualization, but I have invented new terminology for the process because physical address is somewhat ambiguous in a two-level translation. Programs operate using local virtual addresses. These addresses are translated to a system virtual address in a mapping specified by guest operating systems. The guest operating systems consider system virtual addresses as representing physical memory, but actually these addresses are translated again by a system-wide mapping specified by the hypervisor to system interconnect addresses that are used in the routing of accesses in the system fabric. All ports on the system interconnect translate system virtual addresses to system interconnection addresses in local TLBs at the boundary into the system interconnect. This allows guest operating systems to transmit system virtual addresses directly to I/O devices, which may transfer data to or from these addresses, employing the system-wide translation at the port boundary.
Making the svaddr → siaddr translation system-wide is a somewhat radical simplification compared to other virtualization systems. Whether SecureRISC retains this simplification or adopts a more traditional second level translation is open at this point, but my intention is to see if the simplification can suffice. A system-wide mapping means that the hypervisor must give each supervisor unique system virtual addresses for its memory and I/O, and the supervisors must be prevented from referencing the system virtual addresses of the other supervisors via the protection mechanism. This requires that supervisors must not expect memory and I/O in fixed locations. The advantage of a single mapping is that a single 64‑bit svaddr is all that is required when communicating with I/O devices, rather than two 64‑bit addresses (i.e. a page table address and the address within the address space specified by the page table).
A further consequence of the system-wide svaddr translation is that there can be only one hypervisor in the system. In other systems, one could have multiple hypervisors running in parallel, each supporting different sets of supervisors. This generality is elegant, but I wonder how important it is in practice.
The following elaborates on the above:
63 | 61 | 60 | 48 | 47 | 0 | |||||||||
SG | SEG | fill | VPN | byte | ||||||||||
3 | 13 | 48−ssize | ssize−PS | PS |
63 | 48 | 47 | 0 | ||
region | byte address | ||||
16 | 48 |
63 | 0 | |
byteaddress | ||
64 |
63 | 50 | 49 | 6 | 5 | 3 | 2 | 0 | ||||
port | line | word | byte | ||||||||
14 | 44 | 3 | 3 |
71 | 64 | 63 | 0 | ||
tag | data | ||||
8 | 64 |
cliqueto refer to this usage of tags; the clique of memory and pointers must match on access. Cliqued pointers in memory use the tag to represent the allocation containing the pointer, and so different bits must be used to specify the pointer clique, reducing the address space size by 8 bits for such pointers (making only 256 segments addressable). SecureRISC has CLA64 and CSA64 instructions that decode cliqued pointers on load and encode them on store. Cliqued pointers do not need to be word aligned in memory. When a load or store instruction checks memory tags (i.e. when the AR base register memtag field is not 251), if the address is not word aligned and the access crosses a word boundary, then all accessed word tags must match.
71 | 64 | 63 | 0 | ||
mc | data | ||||
8 | 64 |
71 | 64 | 63 | 56 | 55 | 0 | |||
mc | ac | address | ||||||
8 | 8 | 56 |
Field | Width | Bits | Description |
---|---|---|---|
address | 56 | 55:0 | Byte address |
ac | 8 | 63:56 | Clique of addressed memory (0..231) |
mc | 8 | 71:64 | Clique assigned by allocator to memory containing the pointer (0..231) |
71 | 64 | 63 | 0 | ||
type | data | ||||
8 | 64 |
71 | 64 | 63 | 61 | 60 | 0 | |||
memtag | ring | size | ||||||
8 | 3 | 61 |
71 | 64 | 63 | 0 | ||
240 | integer | ||||
8 | 64 |
71 | 64 | 63 | 0 | ||
244 | float64 | ||||
8 | 64 |
71 | 64 | 63 | 0 | ||
0 | 0 | ||||
8 | 64 |
71 | 70 | 64 | 63 | 61 | 60 | 48 | 47 | 0 | |||||||
0 | SS | SG | segment | fill | byte address in segment | ||||||||||
1 | 7 | 3 | 13 | 48−SEGSIZE | SEGSIZE |
tag | SS | Size in Words | G | |||||||
---|---|---|---|---|---|---|---|---|---|---|
2:0 6:3 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||
0..7 | 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 |
8..15 | 1 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 1 |
16..23 | 2 | 16 | 18 | 20 | 22 | 24 | 26 | 28 | 30 | 2 |
24..31 | 3 | 32 | 36 | 40 | 44 | 48 | 52 | 56 | 60 | 4 |
32..39 | 4 | 64 | 72 | 80 | 88 | 96 | 104 | 112 | 120 | 8 |
40..47 | 5 | 128 | 144 | 160 | 176 | 192 | 208 | 224 | 240 | 16 |
48..55 | 6 | 256 | 288 | 320 | 352 | 384 | 416 | 448 | 480 | 32 |
56..63 | 7 | 512 | 576 | 640 | 704 | 768 | 832 | 896 | 960 | 64 |
64..71 | 8 | 1024 | 1152 | 1280 | 1408 | 1536 | 1664 | 1792 | 1920 | 128 |
72..79 | 9 | 2048 | 2304 | 2560 | 2816 | 3072 | 3328 | 3584 | 3840 | 256 |
80..87 | 10 | 4096 | 4608 | 5120 | 5632 | 6144 | 6656 | 7168 | 7680 | 512 |
88..95 | 11 | 8192 | 9216 | 10240 | 11264 | 12288 | 13312 | 14336 | 15360 | 1024 |
96..103 | 12 | 16384 | 18432 | 20480 | 22528 | 24576 | 26624 | 28672 | 30720 | 2048 |
104..111 | 13 | 32768 | 36864 | 40960 | 45056 | 49152 | 53248 | 57344 | 61440 | 4096 |
112..119 | 14 | 65536 | 73728 | 81920 | 90112 | 98304 | 106496 | 114688 | 122880 | 8192 |
120..127 | 15 | 131072 | 147456 | 163840 | 180224 | 196608 | 212992 | 229376 | 245760 | 16384 |
tag | SS | Size in Words | G | |||||||
---|---|---|---|---|---|---|---|---|---|---|
2:0 7:3 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||
128..135 | 16 | 218 | 218+215 | 218+2×215 | 218+3×215 | 218+4×215 | 218+5×215 | 218+6×215 | 218+7×215 | 215 |
136..143 | 17 | 219 | 219+216 | 219+2×216 | 219+3×216 | 219+4×216 | 219+5×216 | 219+6×216 | 219+7×216 | 216 |
144..151 | 18 | 220 | 220+217 | 220+2×217 | 220+3×217 | 220+4×217 | 220+5×217 | 220+6×217 | 220+7×217 | 217 |
152..159 | 19 | 221 | 221+218 | 221+2×218 | 221+3×218 | 221+4×218 | 221+5×218 | 221+6×218 | 221+7×218 | 218 |
160..167 | 20 | 222 | 222+219 | 222+2×219 | 222+3×219 | 222+4×219 | 222+5×219 | 222+6×219 | 222+7×219 | 219 |
168..175 | 21 | 223 | 223+220 | 223+2×220 | 223+3×220 | 223+4×220 | 223+5×220 | 223+6×220 | 223+7×220 | 220 |
176..183 | 22 | 224 | 224+221 | 224+2×221 | 224+3×221 | 224+4×221 | 224+5×221 | 224+6×221 | 224+7×221 | 221 |
184..191 | 23 | 225 | 225+222 | 225+2×222 | 225+3×222 | 225+4×222 | 225+5×222 | 225+6×222 | 225+7×222 | 222 |
71 | 64 | 63 | 4 | 3 | 0 | |||
221 | doubleword address | 0 | ||||||
8 | 60 | 4 |
71 | 64 | 63 | 4 | 3 | 0 | |||
250 | doubleword count | 0 | ||||||
8 | 60 | 4 |
71 | 64 | 63 | 4 | 3 | 0 | |||
250 | − doubleword count | 0 | ||||||
8 | 60 | 4 |
71 | 67 | 66 | 64 | 63 | 61 | 60 | 48 | 47 | 0 | |||||
24 | ring | SG | segment | byte address | ||||||||||
5 | 3 | 3 | 13 | 48 |
71 | 67 | 66 | 64 | 63 | 3 | 2 | 0 | ||||
25 | ring | BB descriptor word address | 0 | ||||||||
5 | 3 | 61 | 3 |
71 | 64 | 63 | 61 | 60 | 0 | |||
223 | 0 | offset | ||||||
8 | 3 | 61 |
71 | 64 | 63 | 0 | ||
232 | Local virtual address | ||||
8 | 64 |
71 | 64 | 63 | 49 | 48 | 45 | 44 | 42 | 41 | 40 | 39 | 24 | 23 | 21 | 20 | 3 | 2 | 0 | ||||||||
251 | P | 0 | R | Z | L | T | TE | B | BE | ||||||||||||||||
8 | 15 | 4 | 3 | 1 | 1 | 16 | 3 | 18 | 3 |
The following gives an overview of the above. See CHERI Concentrate Section 6 for details, except for the ring number field, which is SecureRISC specific.
Field | Width | Bits | Description |
---|---|---|---|
BE | 3 | 2:0 | Bottom bits 2:0 or Exponent bits 2:0 |
B | 18 | 20:3 | Bottom bits 20:3 |
TE | 3 | 23:21 | Top bits 2:0 or Exponent bits 5:3 |
T | 16 | 39:24 | Top bits 18:3 |
L | 1 | 40 | Length bit 19 (exponent bit 6) or Sealed |
Z | 1 | 41 | Internal Exponent flag |
R | 3 | 44:42 | Ring number (SecureRISC specific) |
Perm | 15 | 63:49 | Permissions |
251 | 8 | 71:64 | Tag for CHERI Word 1 |
71 | 64 | 63 | 62 | 57 | 56 | 55 | 29 | 28 | 0 | ||||
CLIQUE | W | E | L | T | B | ||||||||
8 | 1 | 6 | 1 | 27 | 29 |
Field | Width | Bits | Description |
---|---|---|---|
B | 29 | 28:0 | Bottom bits 31..3 |
T | 27 | 55:29 | Top bits 29..3 |
L | 1 | 56 | Length bit 30 |
E | 6 | 62:57 | Exponent |
W | 1 | 63 | Write permission |
CLIQUE | 8 | 71:64 | Clique |
71 | 64 | 63 | 0 | ||
255 | data | ||||
8 | 64 |
As noted earlier, it is useful to provide tags for Common Lisp, Python, and Julia types, even when they are simply pointers to fixed-sized memory, and could theoretically use tags 1..128. This would consume perhaps 10 more tags, as illustrated in the following with the assumption that other types could employ the structure type or something like it (perhaps some of following could do so as well).
Tag | Lisp | Julia | Data use |
---|---|---|---|
0 | nil? | 0 | |
1..31 | simple-vector? | Tuple? | TBD (pointers with exact size in words) |
32..127 | ? (pointer with inexact sizes) | ||
128..191 | no dynamic typing use (Reserved) | ||
192..199 | no dynamic typing use (unsized pointer with ring) | ||
200..207 | Code pointer with ring | ||
208..220 | no dynamic typing use (Reserved) | ||
221 | simple-vector? | Tuple? | TBD (pointer to words with size header) |
222 | no dynamic typing use (Cliqued pointer in AR) | ||
223 | no dynamic typing use (Segment Relative) | ||
224 | CONS | Pointer to a pair | |
225 | Function | Pointer to a pair | |
226 | Symbol | Pointer to structure | |
227 | Structure | Structure? | Pointer to structure |
228 | Array | Pointer to structure | |
229 | Vector | Pointer to structure | |
230 | String | Pointer to structure | |
231 | Bit-vector | Pointer to structure | |
232 | CHERI-128 capability word 0 | ||
233 | no dynamic typing use (Reserved) | ||
234 | Ratio | Rational | Pointer to pair |
235 | Complex | Complex | Pointer to pair |
236 | Bigfloat | BigFloat | Pointer to structure |
237 | Bignum | BigInt | Pointer to structure |
238 | Int128 |
Pointer to pair, −2127..2127−1 tag 241 in word 0, tab 240 in word 1 |
|
239 | UInt128 |
Pointer to pair, 0..2128−1 tag 241 in both word 0 and word 1 |
|
240 | Fixnum | Int64 | −263..263−1 |
241 | UInt64 | 0..264−1 | |
242 | Character | Bool, Char, Int8, Int16, Int32, UInt8, Uint16, Uint32 |
UTF-32 + modifiers, subtype in upper 32 bits |
243 | no dynamic typing use (Reserved) | ||
244 | Float | Float64 | IEEE-754 binary64 |
245 | Float16, Float32 | subtype in upper 32 bits | |
246..249 | no dynamic typing use (Reserved) | ||
250 | no dynamic typing use (header/trailer words) | ||
251 | no dynamic typing use (CHERI word 1) | ||
252..253 | no dynamic typing use (BB descriptor) | ||
254 | no dynamic typing use (trap on load) | ||
255 | no dynamic typing use (trap on load or store) |
In addition to Lisp types, SecureRISC could define tags for other dynamically typed languages, such as Python. Tuples, ranges, and sets might be examples. Other types, such as modules, might use a general structure-like building block rather than individual tags, as suggested for Lisp above.
At times it can be useful to be able to execute untrusted code in an environment where that code has no direct access to the rest of the system, but where it can communicate with the system efficiently. Hierarchical protection domains (aka protection rings) provide an efficient way to provide such an environment. Imagine a web browser that wants to be able to download code from an untrusted source, perhaps use Just-In-Time Compilation to generate native code, and then execute to provide some service as part of displaying the web page. The downloaded code should not be able to access any files or the state of the user browser. For this scenario on SecureRISC, where ring 0 is the least privileged and ring 7 the most privileged (the opposite of the usual convention), the web browser might execute in ring 2, generate machine code to a segment that is writeable from ring 2, but only Read and Execute to ring 0, and then transfer to that ring 0 code. All rings share the same address space and TLB entries for a given process, but the ring brackets stored in the TLB change access to data based on the current ring of execution. Ring 0 would have access only to its code, stack, and heap segments, and nothing else. It would not be able to make system calls or access files, except indirectly by making requests to ring 2. The only access ring 0 would have outside of its three segments might be to call a limited set of gates in ring 2, causing a ring transition. Interrupts and such would be delivered to the browser in ring 2, allowing it to regain control if the ring 0 code does not terminate. The browser and the rest of the system is completely protected by the code executing in ring 0. Because ring 0 is a subset of the address space of ring 2, ring 2 has complete access to all the data in ring 0, but ring 0 has access only to the segments granted to it by ring 2. Ring 2 has the option to grow or not grow the code, heap, and stack segments of ring 0 as appropriate.
One goal for SecureRISC is to support languages, such as Lisp, Julia, Javascript, and Java, that rely on garbage collection (GC), as this eliminates many programming errors that introduce bugs and vulnerabilities. GC is the automatic reclamation of allocated memory by tracing all reachable allocations and freeing the remainder. GC needs to be both low overhead while meeting application response time requirements (e.g. by not pausing the application excessively). SecureRISC will achieve this by including features (described in subsequent sections) for generational GC and per-thread access restrictions to allow concurrent GC to be performed by one processor while another continues to run the application.
Allocation is done in areas, which are for user-visible segregation of different-purpose allocations to different portions of memory. Areas consist of 1-4 generations, each generation consisting of some data structures and many small independent incremental regions that are used to implement incremental GC. The purpose of the incremental regions is to bound the timing of certain GC operations making program delays not proportional to memory size but only to incremental region size. When the application program needs to access an incremental region that has not been processed, the application switches to process it immediately, and then proceeds. The incremental region is small enough that the delay in processing it is acceptable to application performance, but large enough that its overhead is not excessive. A group of incremental regions is called a macro region, and a generation might be one or more macro regions. Macro regions are further divided into those for small and large objects, which use different algorithms for their incremental regions.
The SecureRISC Garbage Collection (GC) terminology introduced so far is briefly summarized below:
New allocations are presumed to how short lifetimes until proven otherwise. Such allocations are ephemeral and done in a generation 0, which is reclaimed frequently. The ephemeral allocations store pointers to all generations, but have few pointers from longer-lived generations to the more ephemeral allocations. For efficiency, reclamation operates without scanning all older allocations. Over time as data remains live in the ephemeral generation for many reclamations, it may be moved to an older generation. To work correctly, pointers in older areas that point to recent ones need to be known and used as roots for recent area scans. The processor hardware helps this process by taking an exception when a pointer to a newer generation is first stored to location in an older generation; the trap handler can note the page being stored to and then continue. The translation cache access for the store will provide both the generation dirty level for the target page and the generation number of the target segment. For the data being stored, the tag indicates whether it is a pointer or not, and if so then the Segment Size Cache provides the generation number of the pointer being stored, and the translation cache provides the generation of the page being stored to. If the page generation is greater than the generation of the pointer being stored, an exception occurs. SecureRISC has support for 4 generations, with generation 0 being the most ephemeral and generation 3 being the least frequently reclaimed. Rather than storing the location of all pointers on a page to more recent generations, the trap might simply note which pages need to be scanned when GC happens later. Because words in memory are tagged, pages can be scanned later without concern that an integer might be interpreted as a pointer. With sufficient operating system sophistication, it is even possible that a page could be scanned prior to being swapped to secondary storage, to prevent it needing to be read back in during GC. After the first trap on storing a recent generation pointer to an older generation page, if only the page is noted for later scan, then the GC field in the PTE would typically be changed by the trap handler so that future stores to the page are not trapped.
Before describing the mechanisms for incremental GC, it is helpful to have a specific GC algorithm in mind. The next section presents the preferred algorithm. After the preferred algorithm, the details of per-thread access restriction for incremental GC are presented.
David Moon, architect of Symbolics Lisp Machines, kindly offered suggestions on Garbage Collection (GC). I have dubbed his algorithm MoonGC. He began by observing the following:
Compacting garbage collection is better than copying garbage collection because it uses physical memory more efficiently.
Compacting garbage collection is better than non-moving garbage collection and C-style heap allocation because it does not cause fragmentation.
First, divide objects into small and large. Large objects are too large to be worth the overhead of compacting, larger than a few times the page size. Large objects are allocated from a
heapand never change their address. The garbage collector frees a large object when it is no longer in use. By putting each large object on its own pages, physical memory is not fragmented and the heap only has to track free space at page granularity. Virtual address space gets fragmented, but it is plentiful so that is not a problem.Small objects are allocated consecutively in fixed-size
regionsby advancing a per-region fill pointer until it hits the end of the region or there is not enough free space left in that region; at that point allocation switches to a different region. The region size is a small multiple of the page size. The allocator chooses the region from among a set of regions that belong to a user-specifiedarea.Garbage collection will compact all in-use objects in a region to the low-address end of that region, consolidating all the free space at the high-address end where it can easily be reused for new allocations.
SecureRISC now uses incremental region
for what MoonGC called
simply region
above. Before continuing, this proposal introduces
this and other terminology in the next section.
One other advantage of compaction, not mentioned above, is that it provides a natural mechanism for determining the long-lifetime data in ephemeral generations: it is the data compacted to the lowest addresses.
MoonGC, as originally presented, is a four phase algorithm to implement the above using only virtual memory and changing page permissions. The following adapts MoonGC to take advantage of the address restriction feature described below, as using virtual memory protection changes is costly. The restriction allows GC to deny application threads access to incremental regions when they are in an inconsistent state. The following also makes other minor changes so that the exposition below is new. The credit goes to David Moon, but problems and bugs are likely the result of these changes and exposition.
The application threads run concurrently with the GC threads, except in phase 3 (the stack scan). Application threads may be slowed during phase 4 as will be explained. The four phases of MoonGC are as follows:
Occasionally an extra phase of the algorithm might compact two incremental regions into one. Still additional phases might migrate objects from a frequent generation to a less frequent one.
This proposal starts with the assumption that software will designate one or more macro regions of the virtual address space to be subject to additional access control for rings ≤ R (controlled by a privileged register so that, for example, user mode cannot affect supervisor mode). For example, when Garbage Collection is used for reclaiming allocated storage, only the heap might be subject to additional protection to implement read or write barriers. These macro regions of the virtual address space are specified using a Naturally Aligned Power-of-2 (NAPOT) matching mechanism to provide flexible sizing. Matching for eight macro regions is currently proposed, which would support four generations of small object macro regions, and four generations of large object regions. This restriction is implemented in a fully associative 8‑entry CAM matching the effective address of loads and stores. A match results in 128 access restriction bits, with one bit selected by the address bits below the match. In particular, there are eight Virtual Access Match registers (amatch0 to amatch7), eight 128‑bit Virtual Address Region Trap registers (atrap0 to atrap7), and eight 128‑bit Virtual Address Region Write Trap registers (awtrap0 to awtrap7). The atrapi/awtrapi registers are read and written 64 bits at a time using low and high suffixes, i.e. atrapil/atrapih and awtrapil/awtrapih. The format of the amatchi registers is as follows, using a NAPOT encoding of the bits to compare when testing for a match.
63 | 18 | 17 | 3 | 2 | 0 | ||||||
vaddr63..19+S | 2S | 0 | TYP | ||||||||
45−S | 1+S | 15 | 3 |
Field | Width | Bits | Description |
---|---|---|---|
TYP | 3 | 2:0 |
0 ⇒ Disabled 1 ⇒ Address restriction for GC 2..7 Reserved |
2S | 1+S | 18+S:18 | NAPOT encoding of virtual address region to match |
vaddr63..19+S | 45−S | 63:19+S | Virtual address to match |
When bits 63:19+S of a virtual address match the same bits of amatchi, then the corresponding atrapil/atrapih and awtrapil/awtrapih pairs specify 128 additional access and write denial bits for the incremental regions thereof. In particular, on a match to amatchi, bits 18+S:12+S of the effective address are used to select bits from the atrapi pair and the awtrapi pair. If the atrapi bit is 1, then loads and stores generate an access fault; else if the awtrapi bit is 1, then only stores generate an access fault. The value of S comes from the NAPOT encoding of amatchi registers, as the number zero bits starting from bit 18 (i.e., S=0 if bit 18 is 1, S=1 if bits 19:18 are 10, and so on). Setting bits 63:18 to 245 causes it to match the entire virtual address space. The lowest numbered amatchi match has priority. If no amatchi register matches then there is no additional access check.
How to control ring access to the above CSRs is TBD, as what ring accesses are trapped.
A atest instruction will be specified to return the incremental region that matches the effective address ea. If there is not a match, these instructions return the null pointer (tag 0). On a match to amatchi they return a pointer (with the appropriate size tag) to ea63..12+S∥012+S based on the S from the matching register.
awtrapi registers are not required for MoonGC, described above, and may be left set to zero for that algorithm. They could be omitted if another use is not found for them, but they may be useful for other GC algorithms.
The efficiency of translating pre-compaction to post-compaction addresses is critical. The original MoonGC proposal recognized that this time is probably limited by data cache misses, and used the preparation phase to convert the bitmaps into a relocation tree that would require only three cache block accesses per translation with binary searching. The following modification is proposed to reduce this to just two cache blocks by making extensive use of population count (popcount) operations.
Within a small object incremental region, the post-compaction offset of
an object is the number of mark bits set in the incremental region
bitmap for objects up to but not including that object. For
translation, summing the popcount on all the words in the bitmap prior
to the word for the pre-compaction address would touch too many cache
blocks, so in phase 2 (preparation) compute the popcounts of each bitmap
cache block and store them for lookup in phases 3 and 4. Each
translation is then one popcount cache block access and one bitmap cache
block access. For a small object incremental region
holding N
objects and a cache block size of 512 bits
(64 B), the number of bitmap cache blocks B
is ⌈N/512⌉
. Store 0
in summary[0]
; store popcount(bitmap[0..511])
in summary[1]
;
store summary[1]+popcount(bitmap[512..1023])
in summary[2]
; and so on … and finally
store summary[B-2]+popcount(bitmap[N-1024..N-511])
in summary[B-1]
.
If N ≤ 65536 then the
summary count array elements fit in 16 bits, and so the size of the
summary array is ⌈B/32⌉
cache blocks, and
if N ≤ 16384 the summary
array fits in only one cache block. To translate from the
pre-compaction offset to the post-compaction offset in phases 3 and 4,
simply take the ⌊offset/512⌋
as the index into
this array to get the number of objects before the bitmap cache block.
Now read the bitmap cache block. Add the popcount of the 1-8 words up
to the object of interest (using a mask on the last word read) to the
lookup value. This sum is the post-compaction offset in the small
object incremental region. If eight popcounts are too costly, then the
summary popcount array may be doubled in size to cover just four words,
or a vector popcount reduction instruction might be added to make this
even more efficient.
As an example, to illustrate the above, consider NAPOT matches on 16 MiB (S=5), which provides 128 access controlled incremental regions of 128 KiB (131072 B) each. An object pointer is converted to its containing incremental region by simply clearing the low 17 bits. There are 16104 words (2013 cache blocks) of object store (98.29%), which are stored starting at offset 0 in the incremental region. The bitmap summary popcounts are 64 bytes starting at 128832. Bitmaps are 2016 bytes (31.5 cache blocks) starting at 128896. Finally there are 160 bytes (20 words, 2.5 cache blocks) of incremental region overhead for locks, freelists, etc. available starting at 130912. To go from the pointer to its bitmap byte, add bits 16:6 to the region pointer plus 128896 and the bit is given by bits 5:3.
Mregion | Iregion | Objects | Summary | Bitmap | Other | |||||
---|---|---|---|---|---|---|---|---|---|---|
S | MiB | words | words | % | words | % | words | % | words | % |
0 | 0.5 | 512 | 480 | 93.75 | 1 | 0.20 | 8 | 1.56 | 23 | 4.49 |
1 | 1 | 1024 | 984 | 96.09 | 1 | 0.10 | 16 | 1.56 | 23 | 2.25 |
2 | 2 | 2048 | 1992 | 97.27 | 1 | 0.05 | 32 | 1.56 | 23 | 1.12 |
3 | 4 | 4096 | 4008 | 97.85 | 2 | 0.05 | 63 | 1.54 | 23 | 0.56 |
4 | 8 | 8192 | 8040 | 98.14 | 4 | 0.05 | 126 | 1.54 | 22 | 0.27 |
5 | 16 | 16384 | 16104 | 98.29 | 8 | 0.05 | 252 | 1.54 | 20 | 0.12 |
6 | 32 | 32768 | 32232 | 98.36 | 16 | 0.05 | 504 | 1.54 | 16 | 0.05 |
7 | 64 | 65536 | 64480 | 98.39 | 32 | 0.05 | 1008 | 1.54 | 16 | 0.02 |
8 | 128 | 131072 | 128976 | 98.40 | 63 | 0.05 | 2016 | 1.54 | 17 | 0.01 |
9 | 256 | 262144 | 257968 | 98.41 | 126 | 0.05 | 4031 | 1.54 | 19 | 0.01 |
10 | 512 | 524288 | 515952 | 98.41 | 252 | 0.05 | 8062 | 1.54 | 22 | 0.00 |
11 | 1024 | 1048576 | 1031928 | 98.41 | 504 | 0.05 | 16124 | 1.54 | 20 | 0.00 |
Smaller incremental regions may provide better real-time response, but limit the size of a macro region due to the 128 access denial bits provided by atrapi pairs. Larger incremental regions pause the application for longer and also require a larger summary popcount array, but allow for larger memory spaces. Generations might choose different incremental region sizes. Typically generation 0 (the most ephemeral) would use small incremental regions, while generation 3 (the most stable) would use incremental regions sized to fit the amount of data required.
With eight amatch sets of registers, half might be used for four generations of small object regions, and half for four generations of large object regions. In the above example, if each bit of atrap controls a 128 KiB small object region, then the ephemeral generation can be as large as 16 MiB. Less ephemeral generations might be larger.
A possible improvement to the algorithm is to have areas use slab allocation for a few common sizes. For example, there might be separate incremental regions for 1, 2, …, 8, and >8‑word objects. This allows a simple freelist to be used for objects ≤8 words so that compaction is not required on every GC. Incremental regions for ≤8 words might only be compacted when it would allow pages to be reclaimed or cache locality to be increased. Note that different tradeoffs may be appropriate for triggering compaction in ephemeral vs. long-lived generations. Also, bitmaps could be potentially use only one bit per object rather than one bit per word in 2‑word, 4‑word, and 8‑word regions, making these even more efficient. However, that requires a more complicated mark and translation code sequence.
When a GC thread finishes compaction of an incremental region, application access is not immediately enabled since that would require sending an interrupt to all the application threads telling them to update their atrap registers. Instead the updated atrap bits are stored in memory, and the next application exception will load the updated value before testing whether compaction is required, in progress, or still needs to be done.
Setting the TYP to 0 in amatchi registers may be used by operating systems to reduce context switch overhead; disabled registers may be treated as having amatchi/atrapi/awtrapi all zero.
This section is very preliminary at this point.
Each ring is capable of handling some of its own exceptions and interrupts. For example, ring N assertion failures (attempts to writes of 1 to b0) turn into a call to the ring N handler. This exception call pushes the PC, the offset in the basic block, plus three words of additional information onto the Call Stack, and a return pops this information. The exception handler is specified in a per ring register. The additional information includes a cause code and may include further information that is cause dependent. The details of the exception mechanism are TBD. Of course, in some cases exceptions should be handled by a more privileged ring (e.g. user page faults should go a supervisor exception handler since the user exception handler might take a page fault, and similarly for second-level page faults for the supervisor and hypervisor). Again details TBD. Also, exceptions in exception handlers may go to a higher ring.
The Basic Block Descriptor (BBD) addresses of the exception handlers for
ring R are given by the per-ring
CSRs ExceptionHandler[R]. As with other
per-ring registers, only rings of equal or higher privilege may write
the register. In addition, values written to this register must have a
code pointer tag designating a ring of equal or higher privilege, but
not higher privilege of PC.ring. Thus
the validity test is as follows:
h ← X[a]
if (R > PC.ring) | (h71..67 ≠ 25) | (h66..64 < R) | (h66..64 > PC.ring) then
exception
endif
In addition the addressed BBD must have tag 252
with prev = 4 (Cross-Ring Exception entry)
or prev = 12 (Same-ring Exception entry).
ExceptionHandler[R] specifies the BBD address for exceptions from less privileged rings to ring R (i.e. for PC.ring < R). Exceptions from ring R to R (i.e. for PC.ring = R) use the modified BBD address ExceptionHandler[R] | 64. This allows cross-ring exceptions to additional state save and restore (e.g. stack switching), while same-ring exceptions are fast (and for example stay on the same stack).
The exception process may itself encounter an exception that must be serviced by a more privileged ring (e.g. a virtual memory exception in writing the call stack). This will be designed so that after the virtual memory exception is remedied, the lower privilege ring exception can proceed. Also, programming or hardware errors might result in an attempt to take an exception in the critical portion of the exception handler, which will be detected, and signal a new exception to a more privileged ring, or a halt in the case of ring 7.
SecureRISC could provide an instruction to push an AR pair and an XR pair onto the Call Stack rather than providing per-ring scratch registers. However, some sort of way of loading new values for these registers to give the exception handler the addresses it needs to save further state is still required. It is unlikely that using an absolute address is acceptable.
Each ring has its own set of interrupt enable and pending bits, and these are distinct from other rings’ bits. Interrupts also use the general exception handler, with a cause code indicating the reason is for an interrupt. Their additional information includes the previous InterruptEnable mask for the target ring. When the interrupt exception occurs, InterruptEnable[ring] is cleared, automatically cleared, including the interrupt being taken, and the original interrupt enables are saved on the Call Stack. The interrupt handler is expected to reenable higher-priority interrupts based on clearing same and lower priority interrupts from the saved enables and writing that back to InterruptEnable[PC.ring]. The bits from the saved enables to clear might be a bitmask from a per-thread lookup table which allows for all 64 interrupts to be prioritized relative to each other.* The RFI instruction restores the interrupt enable bits from the Call Stack. Any pending interrupts that are thereby enabled will be taken before executing the instruction returned to. The RFI instruction may optimize this case by simply transferring back to the handler address rather than popping and pushing the call stack.
* Using a per-interrupt mask of same and lower-priority interrupts is very general and allows for all 64 interrupts to be prioritized relative to each other. However, rather than clearing the ring’s InterruptEnable, which temporarily blocks high-priority interrupts, it would be possible to do the new InterruptEnable computation in hardware as part of the process of taking the interrupt, but this requires a per-ring 64×64 SRAM to specify lower priority interrupts per taken interrupt, and this is a lot to context switch. If it is required, it would instead be possible to provide a per-ring 64×4 SRAM (256 bits to context switch) giving a 4‑bit interrupt priority to each interrupt, and use that to calculate a new InterruptEnable when taking an interrupt. Sixteen priority levels should be sufficient. However, this would require a new RICSR type to be able to read/write 256 bits per-ring, and so this would only be done if it proves necessary.
Interrupt pending bits are set by writing to a memory address specific to the target process. When the process is running, this memory address is redirected to the process’ pending register; otherwise, it will receive the interrupt when it switches to running.
The mechanism for clearing an interrupt pending bit is interrupt dependent. For level-triggered interrupts it is interaction with the interrupt signal source that deasserts the signal, and thus clears the pending bit. For edge-triggered and message-signalled interrupts, the RCSRRCXC instruction may be used clear the interrupt pending bit.
Processors check for interrupts at the start of each instruction. An interrupt is taken instead of executing the instruction if (InterruptPending[ring] & InterruptEnable[ring]) ≠ 0 with the check done in order for ring from 7 down to PC.ring.
Two interrupts are generated by the processor itself and are assigned fixed bits in the InterruptPending and InterruptEnable registers. Bit 0 is for the ICount interrupt, and bit 1 is for the CCount interrupt. In addition, the supervisor is expected to define certain interrupts for user rings. For example, a timer interrupt would typically be created from cycle counts for bit 2.
Interrupts need to be virtualized. SecureRISC expects systems to primarily employ Message Signaled Interrupts (MSIs), where interrupts are transactions on the system interconnect. MSIs are directed to a specific process. If the process is currently executing on a processor, then the interrupt goes to that processor. If the process is not running, then the interrupt must be stored in memory structures (e.g. by setting a bit), and then the scheduler for that process must be notified (e.g. by an interrupt message). When a process is scheduled on a processor, the interrupts stored in memory are loaded into the processor state, and future interrupts are directed to the processor rather than to memory.
To implement this, interrupt messages are directed to one or more specialized Interrupt Processing Units (IPUs). Creating a process allocates system interconnect memory for the process’ interrupt data structures and provides this memory to the chosen IPU. When the process is scheduled, the IPU is informed to forward interrupts directly to it. When a process is descheduled, the IPU is informed to store its interrupts in the allocated memory and send an interrupt to the scheduler.
For some systems a single Interrupt Processing Unit (IPU) may be sufficient. In others it may be appropriate to have multiple IPUs, e.g. one unit per major priority level, so that lower priority interrupts do not impede the processing of higher priority ones. (There may be some sequential processing in IPUs, such as a limitation on outstanding memory operations.) NUMA systems may also want distributed IPUs.
The details of the above are TBD. Conceptually, MSIs would probably address a process through a triple of Interrupt Processor Unit (IPU) number, an opaque identifier referencing a process, and an interrupt number for the process. The opaque identifier would be translated to its associated memory by the IPU, and the interrupt number bounds checked against the number of interrupts configured for the process. Forwarding interrupts to running processes would specify a processor port on the system interconnect, a ring number, and the interrupt number. It may be desirable to fit the interrupt state for a process into a single cache line to help manage atomic transfers between IPUs and processors.
The advantage of this outline is that not specialized storage is required per process. Main memory replaces the specialized storage for non-running processes, and the processor interrupt mechanisms are used for running processes.
Most RISC ISAs use a set of mechanisms to implement dynamic loading and dynamic linking that are less efficient than what SecureRISC can do using tags and a different ABI. Because the RISC-V ABI for dynamic linking is slightly better than some older ABIs, it will be the basis of comparison here.
Most dynamic linking implementations do lazy linking on procedure calls; the first call to a procedure invokes the dynamic linker, which converts the symbol being referenced into an address and arranges that subsequent calls go directly to the specified address. This speeds up program loading because symbol lookup is somewhat expensive, and not all symbols need to be resolved in every program invocation. Lazy linking is not typically done for data symbols because the cost of detecting the first reference and invoking the dynamic linker would require too much extra code at every data access. So data symbol links are typically resolved when a shared library is loaded. In contrast, SecureRISC’s trap on load tag (tag value 255), allows links to external data symbols to be resolved on first reference, which should lead to faster execution initiation.
In RISC‑V, because external symbols are resolved by the dynamic linker when the object is dynamically loaded, it suffices to reference indirect through the link filled in by the linker, which is stored in a section called the Global Offset Table (GOT). In RISC‑V the GOT is a copy-on-write section of the mapped file and is addressed using PC-relative addressing (using the RISC‑V AUIPC instruction).
External symbol and function references are given in the C++, RISC‑V, and SecureRISC code examples below to illustrate the differences between the RISC‑V ABI and the proposed SecureRISC ABI. Starting with the the C++ code:
extern uint64_t count; | // external data | ||
extern void doit(void); | // external function | ||
static void | |||
count_doit(void) | |||
{ | |||
count += 1; | |||
doit(); | |||
} | |||
could be implemented as follows for RISC‑V: | |||
count_doit: | |||
addi | sp, sp, -16 | // allocate stack frame | |
sd | ra, 0(sp) | // save return address | |
.Ltmp: | auipc | t0, %got_pcrel_hi(count) | // load link to count from GOT |
ld | t0, %pcrel_lo(.Ltmp)(t0) | // (PC-relative) | |
ld | t1, (t0) | // load count | |
addi | t2, t1, 1 | // increment | |
sd | t2, (t0) | // store count | |
call | doit@plt | // call doit indirectly through PLT | |
ld | ra, 0(sp) | // restore return address | |
ret | // return from count_doit | ||
where the call pseudoinstruction above is initially: | |||
auipc | ra, 0 | // with relocation R_RISCV_CALL_PLT | |
jalr | ra, ra, 0 | ||
but potentially relaxedto: |
|||
jal | ra, 0 | // with relocation R_RISCV_JAL | |
when the PLT is within the 1 MiB reach of the JAL (see RISC-V ABIs Specification version 1.0). | |||
The PLT target of the above AUIPC/JALR or JAL is a 16‑byte stub with the following contents: | |||
1: | auipc | t3, %pcrel_hi(doit@.got.plt) | |
ld | t3, %pcrel_lo(1b)(t3) | ||
jalr | t1, t3 | ||
nop |
As seen above, the external variable reference is three instructions initially (and subsequently just one, as long as the link is held in the register). The SecureRISC ABI generally requires only two instructions to the first reference.
Also as seen above, the external procedure call is 4-5 instructions with two changes of instruction fetch (two BTB entries), one in the body and one in the PLT. If there are multiple calls to doit in the library, the PLT entry is shared by all the calls. When the number of frequent calls to doit is N, then N+1 BTB entries are required (N from the body, 1 from the PLT). The SecureRISC ABI requires 2 instructions and N BTB entries, which is not significantly different from N+1 for large N, but for N=1 represents half the BTB entries.
The typical POSIX ABI, such as the RISC‑V ABI, is as based on the
C/C++ notion that all functions are top level
. Other languages
allow functions nesting, and is typically implemented by making function
variables two pointers: a pointer to the code to call, and a context
pointer
specifying the stack frame of the function’s parent,
which is used when referencing the parent’s local variables. The
SecureRISC ABI proposal is to adopt the idea that all functions are
specified by a code and context pointer pair, where the context for
top-level functions is a pointer to their global variables and function
links.
One the consequences of the proposed SecureRISC ABI is that copy-on-write is not required. An operating system that implements copy-on-write could use it (the context pointer would point to the data section of the mapped file), but it might avoid copy-on-write by copying the mapped file’s data template to a data segment with read and write permission, which allows page table sharing for the mapped file.
Another consequence is that the method of access to globals and links is the same in both the main application code and dynamically loaded shared libraries. In RISC‑V and other ABIs, the application code typically references global variables via the Global Pointer (gp) register, but with PC-relative references in shared libraries. For SecureRISC, each shared library has a register (the context pointer) for addressing its top-level data.
The C++ function above could be implemented on SecureRISC as follows (with shading used to highlight the basic blocks): | |||
count_doit: | |||
bb | %prcall,%icall|%nohist | ||
entryi | sp, sp, 32 | // allocate stack frame | |
sadi | sp, sp, 0 | // save return address | |
sadi | a10, sp, 16 | // save a10 | |
mova | a10, a1 | // move context to a10 | |
lai | a2, a10, count_link | // load count link | |
lsi | s0, a2, 0 | // load count | |
addsi | s1, s0, 1 | // increment | |
ssi | s1, a2, 0 | // store count | |
ljmpi | a0, a10, doit_link+0 | // doit code pointer | |
lai | a1, a10, doit_link+8 | // doit context pointer | |
bb | %preturn,%return | ||
ladi | a10, sp, 16 | // restore saved register | |
ladi | sp, sp, 0 | // restore stack pointer |
Note that the LJMPI is a load instruction that checks the call prediction performed by the fetch engine when the BB descriptor at the start of count_doit is processed; it does not end the basic block.
Various checks are performed on all load and store instructions:
SecureRISC, as originally conceived, was simply going to specify its memory model as Release Consistency), but after encountering RISC‑V, it seemed wise to look at what had been done there for memory model specification, so this is on hold. This section will be expanded when the memory model is defined.
The following overview is meant to give a general framework to help the reader appreciate the details presented subsequently.
The SecureRISC Instruction Set is designed around six register files, two intended for use early in the pipeline, and four later in the pipeline. While some implementations may not have an early/late distinction, they are described this way here to indicate the possibility of such a split.
Name(s) | Description | Comment | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Early Pipeline |
These instructions have at most three register operands, at most
two sources and one destination except stores, which have up to
three register sources, but never more than two sources from a
given register file. Operations are grouped into classes represented by schemas for conciseness in instruction exposition:
|
||||||||||||||||||||||
AR/AV | Address Registers |
Used as base address for load and store instructions, where the
effective address calculation is
either AR[a] + (XR[a]<<sa) where sa is 0..3, or AR[a] + immediate. The single-bit AVs are valid bits for speculative load validity propagation. |
|||||||||||||||||||||
XR/XV | Index Registers |
Used as for integer calculations related to addressing.
Often the general non-memory format is: XR[d] ← XR[a] xa XR[b] where xa is a fairly simple operation (e.g. + or <<u). The single-bit XVs are valid bits for speculative load validity propagation. |
|||||||||||||||||||||
Late Pipeline |
These instructions have up to three sources and one destination.
SecureRISC makes use of the three source operands more than other
ISAs. Often the general format is: RF[d] ← RF[c] accop (RF[a] op RF[b]) where accop is an accumulation operation (e.g. + or −) and op is more general operation (e.g. ×). Operations are grouped into classes represented by schemas for conciseness in instruction exposition, and most classes have an associated accumulation operation schema:
|
||||||||||||||||||||||
BR/BV | Boolean Registers | Use for comparisons, selection, and branches on scalar registers. | |||||||||||||||||||||
SR/SV | Scalar Registers | Used for both integer and floating-point scalar calculations. Not associated with address calculation. | |||||||||||||||||||||
VR | Index Registers | Used for both integer and floating-point vector calculations. VRs have no associated valid bits and are typically not renamed. | |||||||||||||||||||||
VM | Mask Registers | Used for both integer and floating-point vector masking. VMs have no associated valid bits and are typically not renamed. |
Access to the state of more privileged rings is prohibited. For example, attempting to read or write CSP[ring] when ring > PC.ring causes an exception. Unprivileged state (e.g. the register files) may be accessed by any ring.
In the table below, the Type field values CSR, RCSR (per-ring CSRs), and
ICSR (indirect CSRs) are described
in Control and Status Register Operations.
The type R
is used for a simple register, and RF
for
a Register File,
and VRF
for a Vector Register File.
The user process state includes:
Name | Type | Depth | Width | Read ports | Write ports | Description |
---|---|---|---|---|---|---|
PC | R | 1 | 3 + 61 + 5 |
The Program Counter holds the current ring number (3 bits), Basic Block descriptor address (61 bits—word aligned), and 5‑bit offset into the basic block of the next instruction. The 5‑bit offset is only visible on exceptions and interrupts. When compressed BBDs (tag 253) are defined, this will be 62 bits and 32‑bit aligned. | ||
CSP | RCSR | 8 | 3 + 61 | The Call Stack Pointer holds the ring number and address of the return address stack maintained by call and return basic blocks, and by exceptions and interrupts. This is not the same as the Program Stack Pointer, which is held in an AR designated by the Software ABI. There is one CSP per ring. | ||
ThreadData | RCSR | 8 | 72 | Thread Data is a per-ring storage location for a pointer to Thread-Local Storage (TLS). Functions that require access to per-thread data typically move this to an AR. It is also typically used in cross-ring exception handlers to save and restore the registers that ring requires to handle exceptions. | ||
ExceptionHandler | RCSR | 8 | 3 + 61 | ExceptionHandler[ring] holds the ring number and address to which the processor redirects execution on an exception for that ring. | ||
InstructionCount | RCSR | 8 | 64 | InstructionCount[ring] holds the count of executed instructions in each ring. | ||
BBCount | RCSR | 8 | 64 | BBCount[ring] holds the count of executed Basic Blocks in each ring. | ||
ICountIntr | RCSR | 8 | 64 | The ICount bit in InterruptPending[PC.ring] is set when (InstructionCount[PC.ring] − ICountIntr[PC.ring]) > 0. This may be used for single stepping. | ||
CycleCount | RCSR | 8 | 64 | CycleCount[ring] holds the number of cycles executed by ring. | ||
CCountIntr | RCSR | 8 | 64 | The CCount bit in InterruptPending[PC.ring] is set when (CycleCount[PC.ring] − CCountIntr[PC.ring]) > 0. | ||
InterruptEnable | RCSR | 8 | 64 | InterruptEnable[ring] holds interrupt enable bits for each ring. Interrupts for each ring are distinct. Application rings are expected to use the interrupts for inter-process communication. Supervisor and hypervisor rings will also use interrupts for communication with I/O devices. | ||
InterruptPending | RCSR | 8 | 64 | InterruptPending[ring] holds interrupt pending bits for each ring. | ||
AccessRights | RCSR | 8 | 12 | AccessRights[ring] holds the current Mandatory Access Control Set per ring. It is writeable only by ring 7. These rights are tested against the MAC level of svaddr regions specified in the Region Protection Table and potentially by the System Interconnect. | ||
ALLOWQOS | RCSR | 8 | 6 | ALLOWQOS[ring] holds the minimum value that may be written to QOS by a ring. Rings may not write values to ALLOWQOS[ring] less than ALLOWQOS[PC.ring]. | ||
QOS | RCSR | 8 | 6 | QOS[ring] holds the current Quality of Service (QoS) identifier per ring. QoS identifiers are used on system interconnect transactions for prioritization. Rings may only set QOS to values allowed by ALLOWQOS[PC.ring]. Attempts to write smaller values trap. | ||
KEYSET | CSR | 1 | 16 | This register is writeable only by ring 7, and specifies which encryption key indexes are currently usable. A reference to a disabled key in the ENC field of the Region Descriptor Table causes an exception. This allows ring 7 to partition the system based on which encryption keys are usable. | ||
ENCRYPTION | ICSR | 15 | 8 + 256 | These registers are readable and writeable only by ring 7, and provide the 8‑bit encryption algorithm and 256‑bit encryption key for main memory encryption as specified in Region Descriptor as an index into this array. The encryption algorithm and key are selected by the ENC of the Region Descriptor Table, with 0 being hardwired to no-encryption. Up to 15 pairs may be specified, but some implementations may support a smaller number. This is further defined in Memory Encryption below. | ||
AMATCH | ICSR | 8 | 64 + 128 | These registers are described in Virtual Address Restriction. | ||
AR | RF | 16 | 144 | 2 | 1 |
The Address Register file holds pointers and integers to perform
calculations related to control flow and to load and store address
generation. No AR is hardwired
to 0. Bits 63..0 are address or data (bits 135..133 are the ring
number if address), bits 71..64 are the tag, and bits 132..72 are
the size expanded from the tag, or as written by
the WSIZE instruction, and bits
143..136 are used for the expected memory tag for cliqued
pointers, or are the value 251 for other pointers. In some microarchitectures, operations on ARs are executed in the early pipeline, either speculatively or non-speculatively. (Late pipeline operations may be queued until non-speculative or may be speculatively executed as well.) Most instructions that read ARs read only AR[a]. When two ARs are read, it is sometimes using the b field and sometimes the c field (AR stores read AR[c] and a few branches and SUBXAA read AR[b]). The b/c multiplexing can be done during instruction decode. |
AV | RF | 16 | 1 | 1 | 1 | The Address Register Valid file holds valid bits from speculative loads and propagation therefrom. |
XR | RF | 16 | 72 | 2 | 1 |
The Index Register file holds integers to perform
calculations related to control flow and to load and store
address generation.
No XR is hardwired to 0.
Bits 63..0 are data and bits 71..64 are the tag. The XR
primarily holds integer-tagged data, but other tags may be
loaded. In some microarchitectures, operations on XRs are executed in the early pipeline, either speculatively or non-speculatively. (Late pipeline operations may be queued until non-speculative or may be speculatively executed as well.) The XR register file requires two read ports and one write port per instruction. Most instructions that read XRs read XR[a] and XR[b], but XR stores read XR[b] and XR[c]. The b/c multiplexing can be done during instruction decode. |
XV | RF | 16 | 1 | 2 | 1 | The Index Register Valid file holds valid bits from speculative loads and propagation therefrom. |
SR | RF | 16 | 72 | 3 | 1 |
The Scalar Register file holds data for computations not involved
in address generation and primarily holds integer or
floating-point values. Tags are stored, and
so SRs may be used for copying
arbitrary data, including pointers, but no instruction
uses SRs as
an address (e.g. base) register. Integer operations check for
integer tags, and floating-point operations check for float tags.
No SR is hardwired to 0. In some microarchitectures, operations on SRs occur later in the pipeline than operations on ARs, separated by a queue, allowing these operations to wait for data cache misses while the AR engine continues to move ahead generating addresses. When multiple functional units operate in parallel, only some will support 3 source operands, with the others only two. The most important instructions with three SR source operands are multiply/add (both integer and floating-point), and funnel shifts. The three SR read ports handle the a, b, and c register specifier fields, with writes specified by the d register field. SRs are late pipeline state. |
SV | RF | 16 | 1 | 3 | 1 | The Scalar Register Valid file holds valid bits from speculative loads and propagation therefrom. SVs are late pipeline state. |
BR | RF | 16 | 1 | 3 | 1 | Boolean Registers hold 0/False or 1/True, such as the result of comparisons and logical operations on other Boolean values. BRs are typically used to hold SR register comparisons and may avoid branch prediction misses in some algorithms. BR[0] is hardwired to 0. Attempts to write 1 to BR[0] trap, which converts such instructions into negative assertions. BRs are late pipeline state. |
BV | RF | 16 | 1 | 3 | 1 | The Boolean Register Valid file holds valid bit propagation from speculative loads (primarily SR loads). Branches with an invalid BR operand take an exception. BVs are late pipeline state. |
CARRY | RF | 1 | 64 |
The CARRY register is used on multiword
arithmetic (addition, subtraction, multiplication, division, and
carryless multiplication). See below. Consider expansion of CARRY to a 4-entry register file (c0 to c3). CARRY is late pipeline state. |
||
VL | CSR | 1 | 64 | The Vector Length register specifies the length of vector loads, stores, and operations. VL is late pipeline state. | ||
VSTART | CSR | 1 | 7 | The Vector Start register is used to restart vector operations after exceptions. Details to follow. VSTART is late pipeline state. | ||
VM | RF | 16 | 128 | 3 | 1 |
The Vector Mask register file stores a bit mask for elements of
vector operations. VM[0] is
hardwired to all 1s and is used for unmasked operations. Only VM[0] to VM[3] may be specified for masking vector operations in 32-bit instructions. VM[4] to VM[15] are available for vector comparison results and Boolean operations and in 48‑bit and 64‑bit formats. VMs are late pipeline state. |
VR | VRF | 16 | 72 × 128 | 3 | 1 | Vector Registers hold vectors of tagged data, typically integers or floating-point data. (There are no speculative loads for the VRs and no associated valid bits. Vector operations with an invalid non-vector operand take an exception.) VRs are late pipeline state. |
The SR register file must support 3 read and 1 write port per instruction for floating-point multiply/add instructions at least. Since it does, other operations on SRs may take advantage of the third source operand.
71 | 64 | 63 | 58 | 57 | 47 | 46 | 38 | 37 | 34 | 33 | 29 | 28 | 13 | 12 | 11 | 10 | 9 | 0 | ||||||||
252 | hint | targr | targl | next | prev | start | s | c | offset | |||||||||||||||||
8 | 6 | 11 | 9 | 4 | 5 | 16 | 1 | 2 | 10 |
Field | Width | Bits | Description |
---|---|---|---|
offset | 10 | 9:0 | Instruction offset in bage for this BB |
c | 2 | 11:10 |
LOOPX present 0 ⇒ no LOOPX 1 ⇒ LOOPX present 2..3 ⇒ Reserved (possible use for nested loops) |
s | 1 | 12 |
Instruction size restriction: 0 ⇒ 16, 32, 48, and 64 bit instructions 1 ⇒ 32 and 64 bit instructions only |
start | 16 | 28:13 | Instruction start mask (interpretation depends on s field) |
prev | 5 | 33:29 | Mask of things targeting this BB for CFI checking |
next | 4 | 37:34 | BB type / exit method |
targl | 9 | 46:38 | Target BB offset in bage (virtual address bits 11:3) |
targr | 11 | 57:47 | Target BB bage relative to this bage (±1024 4 KiB bages) |
hint | 6 | 63:58 | Prediction hints specific to BB type |
252 | 8 | 71:64 | BB Descriptor Tag |
Basic block descriptors are words with tag 252 aligned on a word boundary. The basic block descriptor points to the instructions and gives the details of the control transfers to successor basic blocks.
The s and 16‑bit start fields specify both the size of the basic block and the location of all the instructions in it. If s is set, then all instructions are 32‑bit or 64‑bit; if clear then 16‑bit and 48‑bit instructions may also be present. For s = 0, each bit represents 16 bits at offset in the bage, and the BB size can be up to sixteen 16‑bit locations, which could contain eight 32‑bit instructions, sixteen 16‑bit instructions, or an intermediate number of a mixture of the two, or a lesser number if 48‑bit and 64‑bit instructions are included. For s = 1, each bit represents 32 bits and the BB size can be up to sixteen 32‑bit locations, which could contain sixteen 32‑bit instructions, eight 64‑bit instructions, or an intermediate number of a mixture of the two. If the block is larger than these limits, then it is continued using a fall-through next field. The 16‑bit start field gives a bit mask specifying which 2‑byte locations start instructions, which allows parallel instruction decoding to begin as soon as the instruction bytes are read from the instruction cache. For example, sixteen instruction decoders could be fed in a single cycle from a single 8‑word instruction cache line fetch, using the start mask to specify which bytes to decode. The start bit for the first 16 bits is implicitly 1 and is not stored. The last 1 bit in the start field represents the 2‑byte position after the last instruction. Thus, the number of instructions is the number of 1 bits in the start field (if 0 bits are set, then there are 0 instructions). If the last instruction ends before a 32‑bit boundary, the last 16 bits should be filled with an illegal instruction. The s = 1 case is intended for floating-point intensive basic blocks which tend to have few 16‑bit instructions and also tend to be longer.
To increase locality and keep pointers short, SecureRISC stores basic
block descriptors and instructions into 4 KiB regions of
the address space (called bages) with the basic block
descriptors in the one half and the instructions in the other half
(the compiler should alternate the half used for even and
odd bages to minimize set conflicts). This allows the pointer
from the descriptor to 32‑bit aligned instructions
to be only 10 bits, and in a paged system, the same TLB entry
maps both the descriptors and instructions (since bage size ≤
page size), so only the BB engine requires a TLB (its translations are
simply forwarded to the instruction fetch engine). The instructions
are fetched from
PC63..12 ∥ offset ∥ 02
in the L1 instruction cache in parallel with the BB engine moving to
fetch the next BB descriptor. For non-indirect branches and calls,
the target is given by an 11‑bit signed relative 4 KiB
delta from the current bage and a 9‑bit unsigned 8‑byte
aligned descriptor address within that bage. Specifically
TargetPC ← PC66..64 ∥ (PC63..12 +p (targr1041∥targr)) ∥ targl ∥ 03.
(Note: the name targr is short for
target relative and targl is
short for target low.)
For indirect branches and calls,
the targr/targl fields may be used
as a hint or default.
Instructions are stored in the bage with tag 240, which may be helpful when code reads and writes instructions in memory. A future option may be to use tags 240..243 to provide two more bits for instruction encoding per word, or one bit per 32‑bit location. Using 16 tags would provide four more bits per word, or one bit per 16‑bit location.
The low targl field is sufficient to index a set-associative BB descriptor cache that uses bits 11..3 (or a subset) as a set index without waiting for the targr addition giving the high bits. As an example, a 32 KiB, 8‑way set associate BB descriptor could read the tags in parallel with completing the addition giving the high address bits for tag comparison. If the minimum page size can be increased, then the number of bits allocated to the targl and offset fields might be increased and the bits to targr decreased; the current values were chosen for a minimum page size of 4 KiB, which encourages a bage size of 4 KiB to match. When targr = 0, the TLB translation for the current BB remains valid, and energy can be saved by detecting this case.
For even bages, it is recommended that BB descriptors start at the beginning of a bage, and instructions start on a 64‑byte boundary in the bage. Any full word padding between the last BB descriptor and the first instruction would use an illegal tag. For odd bages, BB descriptors would be packed at the end starting on a 64‑byte boundary and the instructions start at the beginning. Intermixing BB descriptors and instructions is possible but is not ideal for prefetch or cache utilization.
A non-zero c field (assembler %loopx) indicates that the BB contains a LOOPX/LOOPXI instruction, and therefore the BB engine should initialize its iteration count to zero and should predict the count until the AR engine executes the LOOPX and sends the actual loop count value back. If no prediction is available, 264−1 may be used. Often the AR engine does so before the final iteration, and the loop is predicted precisely even if this default loop count prediction is used. The iteration count increments when the next field contains a loop back or conditional loop back, and these are predicted as taken based on the iteration count being unequal to the predicted or actual loop count.
The next field specifies how the next basic block after this one is selected. It is sufficient to enable branch prediction, jump prediction, return address prediction, loop back prediction, etc. to occur without seeing the instructions involved in the basic block. Its values are described in Basic Block Descriptor Types in the subsequent section.
The prev field is used for Control Flow Integrity (CFI) checking and to implement the gates for calls to more privileged rings. It too is described in Basic Block Descriptor Types in the subsequent section.
The hint field will be defined in the future for prediction hints specific to each next field value. For example, conditional branches will use the hint field with a taken/not-taken initial value for prediction, a hysteresis bit (strong/weak), and an encoded 4‑bit history length (8, 11, 15, …, 1024) indication of what global history is most likely to be useful in prediction. Similarly indirect jumps and calls may have hints appropriate to their prediction. More hint bits would be nice to have, for example to encode Whisper’s Boolean function.
Note: I expect to use tag 253 for packing multiple Basic Block Descriptors in a single word. However, the details of this would probably be driven by statistics gathered once a compiler is generating the unpacked descriptors. This is expected to be limited to the BBDs that are internal to functions (simply branching).
The next field of the BB descriptor is used to specify how the successor to the current BB is determined. The values are given in the following table:
Value | Description | ||||||
---|---|---|---|---|---|---|---|
0 | Unconditional branch (%ubranch): The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above. There should be no branch or jump instructions in the basic block, as there is no prediction to check. | ||||||
1 | Conditional branch (%cbranch): The branch predictor is used to determine whether this branch is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. There should be exactly one conditional branch instruction, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above or is the fall-through BB descriptor at PC + 8. | ||||||
2 | Call (%rcall): The address PC + 8 is written to the word pointed to CSP[TargetPC66..64], and CSP[TargetPC66..64] is incremented by 8. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above. There should be no branch or jump instructions in the basic block, as there is no prediction to check. | ||||||
3 | Conditional Call (%crcall): The branch predictor is used to determine whether this call is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. There is no instruction for the call itself in the basic block, as this is not predicted. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above or is the fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC66..64], and CSP[TargetPC66..64] is incremented by 8. | ||||||
4 | Loop back (%loop): The predicted loop iteration count is used to predict whether this loop is taken or not, and this prediction is checked by the SOBX instruction in the instructions of the basic block. There should be exactly one SOBX, which may be located anywhere in the basic block instructions. There should be no other branch or jump instructions in the basic block. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above or is the fall-through BB descriptor at PC + 8. | ||||||
5 | Conditional Loop back (%cloop): The branch predictor is used to determine whether this loop back is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the loop back is enabled by the branch, the predicted loop iteration count is used to determine whether this loop is taken or not, and this prediction is checked by the SOBX instruction in the instructions of the basic block. There should be exactly one SOBX, which may be located anywhere in the basic block instructions and exactly one conditional branch instruction, but no jump instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below or is the fall-through BB descriptor at PC + 8. | ||||||
6 |
Fall through (%fallthrough):
This Basic Block is unconditionally followed by the
BB at PC + 8. The targr/targl/start fields are not required for fall-through, so instead they may be used for prefetch. The targr/targl fields would then specify the first of several lines to prefetch into the BB Descriptor Cache (BBDC). The three least-significant bits of the targl field are not needed to specify a line in the BBDC, and are instead a sub-type:
|
||||||
7 | Reserved. | ||||||
8 |
Jump Indirect (%ijump):
The indirect jump predictor is used to predict the
destination BB descriptor address, and this prediction is checked
by the JMPA/LJMP/LJMPI/SWITCHX/etc.
instructions in the instructions of the basic block. There should
be exactly one jump, which may be located anywhere in the basic
block instructions, but no conditional branches. The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC. |
||||||
9 |
Conditional Jump Indirect (%cijump):
The branch predictor is used to
determine whether this jump indirect is taken or not, and this
prediction is checked by the branch decision is given by a branch
instruction in the instructions of the basic block. If the jump
indirect is enabled by the branch, the indirect jump predictor is
used to predict the destination BB descriptor address, and this
prediction is checked by
the JMPA/LJMP/LJMPI/SWITCHX/etc.
instruction in the instructions of the basic block. There should
be exactly one jump and exactly one conditional branch, which each
may be located anywhere in the basic block instructions. In the
case the jump is not taken the destination is fall-through BB
descriptor
at PC + 8. This type is expected to be used for case dispatch, where the conditional test checks whether the value is within range, and the JMPA/LJMP/LJMPI/SWITCHX uses PC ← PC + (XR[b] × 8) to choose one of several dispatch basic block descriptors, presuming that the BBs fit in the same 4 KiB bage (if not then a table and PC ← lvload72(AR[a] + XR[b]) should be used). The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC. |
||||||
10 |
Call Indirect (%icall):
The indirect jump predictor is used to predict the
destination BB descriptor address, and this prediction is checked
by the LJMP/LJMPI instruction in the
instructions of the basic block.
There should be exactly one LJMP/LJMPI,
which may be located anywhere in the basic block instructions, but
no conditional branch instructions.
The address PC + 8
is written to the word pointed
to CSP[TargetPC66..64],
and CSP[TargetPC66..64] is
incremented by 8. The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC. |
||||||
11 |
Conditional Call Indirect (%cicall):
The branch predictor is used to
determine whether this call indirect is taken or not, and this
prediction is checked by the branch decision is given by a branch
instruction in the instructions of the basic block. If the call
indirect is enabled by the branch, the indirect jump predictor is
used to predict the destination BB descriptor address, and this
prediction is checked by
the JMPA/LJMP/LJMPI/etc.
instruction in the
instructions of the basic block. There should be exactly one
jump, which may be located anywhere in the basic block
instructions, and exactly one conditional branch. In the case the
call is not taken the destination is fall-through BB descriptor
at PC + 8. In the case
where the call is taken, the
address PC + 8 is
written to the word pointed
to CSP[TargetPC66..64],
and CSP[TargetPC66..64] is
incremented by 8. The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC. |
||||||
12 |
Return (%return):
The Call Stack cache is used to predict the return
using CSP[PC66..64] − 8
as the index and CSP[PC66..64]
is decremented by 8. The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC. It may be desirable to encode Exception Return with this BB type. Using hint1 might be used to distinguish this case. |
||||||
13 | Conditional return (%creturn): This is probably only useful in leaf functions without a stack frame, unless register windows are added. | ||||||
14 | Reserved. | ||||||
15 | Reserved. |
The prev field of the BB descriptor is used to specify what methods are allowed to get to this BB for Control Flow Integrity (CFI) checking. It is a set of values bits, with the least significant bits of prev controlling interpretation of the more significant bits as follows:
Bit | Description | Assembler |
---|---|---|
1 | Fall through to this BB allowed | %pfallthrough |
2 | Branch/Loopback to this BB allowed | %pbranch |
3 | Jump to this BB allowed (for case dispatch) | %pswitch |
4 | Return to this BB allowed | %preturn |
Bit | Description | |
---|---|---|
2 | Call relative allowed | %prcall |
3 | Call indirect allowed | %picall |
4 | Gate allowed | %pgate |
Bits 4..3 | Description | |
---|---|---|
0 | Cross-ring Exception Entry | %pxrexc |
1 | Same-ring Exception Entry | %psrexc |
2 | Reserved | |
3 | Reserved |
Basic Block descriptors with one of the four call types (Call,
Conditional Call, Call Indirect, Conditional Call Indirect), push the
return address on a protected stack addressed by
the CSP indexed by the target ring
number (which is the same as the current ring number unless a gate is
addressed). Returns pop the address from the protected stack and jump
to it. The ring number of the CSP
pointer is used for the stores and loads, and typically this ring is not
writeable by the current ring.
The call semantics are as follows:
lvstore72(CSP[TargetPC66..64]) ← PC
CSP[TargetPC66..64]) ← CSP[TargetPC66..64]) +p 8
The return semantics are as follows:
PC ← lvload72(CSP[TargetPC66..64] −p 8)
CSP[TargetPC66..64]) ← CSP[TargetPC66..64]) −p 8
Overflow detection is important for implementing bignums in languages such as Lisp. SecureRISC provides a reasonably complete set of such instructions in addition to the usual mod 264 add, subtract, negate, multiply, and shift left.
Unsigned overflow could be detected by using the ADDC and SUBC instructions with BR[0] as the carry-in and BR[0] as the carry-out. But it might also make sense to have ADDOU (Add Overflow trapped Unsigned).
In addition, the ADDOS (Add Overflow Trapped Signed), ADDOUS (Add Overflow trapped Unsigned Signed), SUBOS (Subtract Overflow trapped Signed), SUBOU (Subtract Overflow trapped Unsigned), SUBOUS (Subtract Overflow trapped Unsigned Signed with Overflow), and NEGO (Negate Overflow trapped) instructions provide overflow checking for signed addition, subtraction, and negation, and signed-unsigned addition and subtraction. There is also SLLO (Shift Left Logical with Overflow) and SLAO (Shift Left Arithmetic with Overflow) in addition to the usual SLL. Finally there are MULUO, MULSO, and MULSUO for multiplication with overflow detection.
Overflow in the unsigned addition of load/store effective address generation is trapped. Segment bounds are also checked during effective address generation: the segment size is determined from the base register, and the effective address must agree with the base register for bits 63..size. A special small cache is required for this purpose, but the data portion is only eight bits of the Segment Descriptor Entry (a 6‑bit segment size and a 2‑bit generation).
SecureRISC has comparisons that produce both true and complement values (e.g. = and ≠ or < and ≥) so that they can be used with b0 as assertions. If b1 were hardwired to 1 and writes of 0 trapped, SecureRISC could have half as many comparisons, but would have to add more accumulation functions and SELI would have to have an inverted form. This would also require more compiler support to track whether Booleans in BRs are inverted or not. For the moment, SecureRISC has more comparisons, but might change.
SecureRISC provides floating-point comparisons that store 0 or 1 to a BR. These comparisons do not trap on NaN operands. The compiler can generate an unordered comparison to b0 to trap before doing the equal, less than, etc. test if traps on NaNs are required.
SecureRISC has trap instructions and Boolean Registers (BRs) primarily as a way to avoid conditional branching for computation. For example, to compute the min of x1 and x3 into x6, the RISC‑V ISA would use conditional branches:
move x6, x1 blt x1, x3, L move x6, x3 L:
The performance of the above on contemporary microarchitectures depends on the conditional branch prediction rate and the mispredict penalty, which in turn depends on how consistently x1 or x3 is the minimum value. In SecureRISC, the sequence could be as follows:
lts b2, s1, s3 sels s6, b2, s1, s3
This sequence involves no conditional branches and has consistent performance. (Note: there is actually a minss instruction that would be preferred here, but this illustrated a general point.)
As another example, the range test
assert ((lo <= x) && (x <= hi));
on RISC‑V would compile to
blt x, lo, T bge hi, x, L T: jal assertionfailed L:
but on SecureRISC would compile to
lts b1, x, lo orles b0, b1, hi, x
which involves no conditional branches, but instead using writes to b0 as a negative assertion check (trap if the value to be written is 1). The assembler would also accept torles b1, hi, x as equivalent to the above orles by supplying the b0 destination operand.
Even when conditional branches are used, the Boolean registers sometimes permit several tests to be combined before branching, so if we were branching on the range test above, instead of asserting it, the code might be
lts b1, x, lo borles b1, hi, x, outofrange
which has one branch rather than two.
Operations on tagged values trap if the tags are unexpected values. Integer addition requires that both tags be integers, or one tag be a pointer type and the other an integer. Integer subtraction requires the subtrahend tag to be an integer tag and the minuend to be either an integer or pointer tag. The resulting tag is integer with all integer sources, or pointer if one operand is a pointer. Integer bitwise logical operands and shifts require integer-tagged operands and produce an integer-tagged result. Floating point addition, subtraction, multiplication, division, and square root require floating-point tagged operands. To perform integer operands on floating-point tagged values (e.g. to extract the exponent) requires a CAST instruction to first change the tag. Similarly, to perform logical operations on a pointer, a CAST instruction to integer type is required.
Comparisons of tagged values compare the entire word in its entirety for =, ≠, <u, ≥u etc. This allows sorting regardless of type. Similarly, the CMPU operation produces −1, 0, 1 based on <u, =, >u of word values.
The ideal integer multiplication operation would be
SR[e],SR[d] ← (SR[a] ×u SR[b]) + SR[c] + SR[f]
to efficiently support multiword multiplication, but that requires 4
reads and 2 writes, which we clearly don’t want. The chosen
alternative is to introduce a
64‑bit CARRY register to provide the
additional 64‑bit input to the 128‑bit product and a place
to store the high 64 bits of the product as follows:
p ← SR[c] + (SR[a] ×u SR[b]) + CARRY
SR[d] ← p63..0
CARRY ← p127..64
The CARRY register is potentially awkward for
OoO microarchitectures. The simplest option is to rename it to a small
register file (e.g. 4 or 8‑entry) in the multiword arithmetic
unit. It is also possible that even an OoO processor will be called on
to have a subset of instructions that are to be executed in-order
relative to each other, and the multiword arithmetic instructions can be
put in this queue.
The ideal integer division operation would be
SR[e],SR[d] ← SR[c]∥SR[a] ÷u SR[b]
to efficiently support multiword division, but that requires 3 reads and
2 writes for quotient and remainder, which we clearly don’t want.
As with multiplication, the alternative is to use the proposed
64‑bit CARRY register to provide the
additional 64‑bit input to form the 128‑bit dividend and a
place to store the remainder. The remainder of the previous division
then naturally becomes the high bits of the current division. Thus the
definition of DIVC is:
q,r ← (CARRY∥SR[a]63..0) ÷u SR[b]63..0
CARRY ← r
SR[d] ← 240 ∥ q
Addition of polynomials over GF(2) is just xor (addition without
carries), and so the existing bitwise logical XORS instruction provides
this functionality. Polynomial multiplication requires carryless
multiplication instructions. Three forms are provided:
CARRY,SR[d] ← SR[a] ⊗ SR[b]
CARRY,SR[d] ← (SR[a] ⊗ SR[b]) ⊕ SR[c]
CARRY,SR[d] ← (SR[a] ⊗ SR[b]) ⊕ SR[c] ⊕ CARRY
A modulo reduction instruction may not be required, as illustrated by
the following example. In many
applications, the field uses a polynomial such
as 𝑥128+𝑥7+𝑥2+𝑥+1
and in this case a 256→128 reduction can be implemented by further
multiplication. First a series of carryless multiplication instructions
are used to form the 255‑bit
product p of two 128‑bit values.
Bits 254..128 of this product have
weight 𝑥128,
i.e. represent (p254𝑥126+…+p129𝑥+p128)𝑥128.
Because 𝑥128 mod 𝑥128+𝑥7+𝑥2+𝑥+1
is
just 𝑥7+𝑥2+𝑥+1,
multiplication of p254..128
by this value results in a product q
with a maximum term
of 𝑥133. q127..0
is added to p127..0
and q133..128 of that product
can then be multiplied again
by 𝑥7+𝑥2+𝑥+1
resulting in a product with a maximum term
of 𝑥12, which can
then be added to the low 128 bits of the original product (p127..0).
This generalizes to any modulo polynomial with no term
after 𝑥128 greater
than 𝑥63. If most
modulo reductions are of this form, then no specialized support is
required.
The ideal instructions for multiword addition and subtraction need
additional single bit inputs and outputs for the carry-in and
carry-out. The BRs would be natural
for this purpose, but this would result in undesirable five-operand
instructions, e.g. Add with Carry (ADDC)
would be:
s ← SR[a] +u SR[b] +u BR[c]
SR[d] ← s63..0
BR[e] ← s64.
To avoid five operand instructions, SecureRISC instead defines the Add
with Carry (ADDC) and Subtract with Carry
(SUBC) instructions to use one bit in the
64‑bit CARRY
register. ADDC is defined as:
s ← SR[a] +u SR[b] +u CARRY0
SR[d] ← s63..0
CARRY ← 063 ∥ s64.
SUBC is defined as:
s ← SR[a] −u SR[b] −u CARRY0
SR[d] ← s63..0
CARRY ← 063 ∥ s64.
One advantage of the 3 read SR file is
that shifts can be based upon a
funnel shift where the value to be shifted is the catenation
of SR[a]
and SR[b],
allowing for rotates by specifying the same operand for the high and low
funnel operands, and multiword shifts by supplying adjacent source words
of the multiword value. The basic operations are then
SR[d] ← (SR[b] ∥ SR[a]) >> imm6,
SR[d] ← (SR[b] ∥ SR[a]) >> (SR[c] mod 64), and
SR[d] ← (SR[b] ∥ SR[a]) >> (−SR[c] mod 64).
Conventional logical and arithmetic shifts are also provided. Left
shifts supply 0 for the lo side of the funnel and use a negative shift
amount. Logical right shifts supply 0 on the high side of the funnel
and arithmetic right shifts supply a signed-extended version
of SR[a] on the high side of the funnel.
Need to decide whether overflow detecting left shifts are required.
The CARRY register could be use as funnel shift operand instead of an SR, but that seems less flexible.
SecureRISC has a large number of registers that affect instruction execution. These registers, called CSRs, are accessed by special instructions that support reading, writing, swapping, setting bits, and clearing bits. Many ISAs have such instructions; the only unusual aspect of SecureRISC is the per-ring registers (RCSRs) and indirect CSRs (ICSRs). RCSRs can be accessed in two ways: first, via the CSR number in the immediate field and ring from an XR; and second via an encoding that refers to the register for the current ring of execution (PC.ring). ICSRs are accessed with a CSR base immediate, CSR index from an XR and an offset for the word of data at that index. For example, the ENCRYPTION ICSRs have five 64‑bit values for a given index (an 8‑bit algorithm and 256 bits of encryption key). Similarly the amatch ICSRs have five 64‑bit values for a given index (the address to match, 128 bits of region access permission, and 128 bits of region write permission).
Most CSRs, and all RCSRs and ICSRs read and write to and from the XRs. Late pipeline CSRs may be read and written to and from the SRs. It is TBD whether late pipeline CSRs may be read and written from the XRs, as such transfers are likely to many cycles, and so a case for providing them needs to be made.
Read and writing CSRs have no side effects. CSR operations always return the old value of the CSR, which if not useful, wastes a register, but that seems acceptable compared to providing separate opcodes to avoid the write.
In the specifications below, the definition of n ← op(o,v,m) comes from the opcode (mnemonic RW for Read-Write, mnemonic RS for Read-Set, mnemonic RC for Read-Clear). Here o is the old value, v is the operand value, and m is the per-CSR bitmask specifying which bits are writeable (some bits possibly being read-only).
Description | op | Definition |
---|---|---|
Read-Write | RW | n ← (o & ~m) | (v & m) |
Read-Set | RS | n ← o | (v & m) |
Read-Clear | RC | n ← o & ~(v & m) |
SecureRISC has two sorts of instructions for synchronization via memory
locations. The first is one of the primitives that can implement most
synchronization
methods: Compare And Swap).
Compare And Swap (CAS) exists for
the SRs
(CASS, CASSD, CASS64, CASS128,
CASSI, CASSDI, CASS64I, CASS128I).
and perhaps the XRs
(CASX, CASXD, CASX64, CASX128,
CASXI, CASXDI, CASX64I, CASX128I)
It is possible that 8, 16, and 32‑bit versions of Compare And Swap
might also be provided. It is also plausible that 288‑bit (half
cache block) and 576‑bit (whole cache block) CAS may be provided
from the VRs.
The basic schema of CAS is illustrated by the
following simplified semantics of CASS64,
with the other instruction formats being similar:
ea ← AR[a]
expected ← SR[b]
new ← SR[c]
m ← lvload64(ea)
if m = expected then
lvstore64(ea) ← new
endif
SR[d] ← m
This specification clearly violates the number of read and write ports
for the XRs,
and the CASX forms might be omitted, but
CAS instructions are likely at least two cycle instructions, and might
read the register file over two cycles. However, it is possible that a
CSR could be introduced for the expected value, though this would mean
longer instruction sequences for synchronization. TBD.
The second synchronization is not as powerful as Compare And Swap, and
could be implemented by CAS, but it may be more efficient in some
circumstances. It is atomic load and add
(AADDX64,
and AADDS64).
These instructions load the specified memory location into the
destination register and then atomically increment the memory location,
as illustrated by the following simplified semantics
of AADDX64:
m ← lvload64(ea)
t ← m + 1
lvstore64(ea) ← t
XR[d] ← m
These operations correspond to the ticket(S)
operation on a
sequencer, as defined
in Synchronization with Eventcounts and Sequencers
by Reed and Kanodia, though sequencers only require an atomic increment,
the generalization to AADDX64 etc. keeps
the system interconnect transaction for uncached atomic add similar to
atomic AND and OR below.
The third synchronization instructions are even less powerful than
atomic increment, and could be implemented by CAS, but may be more
efficient in some circumstances, such as the GC mark phase for updating
bitmaps. The instructions are atomic
AND (AANDX64), atomic
OR (AORX64), and atomic
XOR (AXORX64). These instructions load
the specified memory location into the destination register and then
atomically AND, OR, or XOR the memory location, as illustrated by the
following simplified semantics
of AANDX64:
m ← lvload64(ea)
t ← m & AR[b]
if t ≠ m then
lvstore64(ea) ← t
endif
XR[d] ← m
The case for RISC‑V’s
AMOSWAP, AMOMIN,
and AMOMAX
seem unclear at this point. (The case for
RISC‑V’s AMOXOR is also
unclear to this author, but it is trivial given support
for AANDX64
and AORX64, and also called for
by C++11 std::atomic
and so included.)
Some APIs (e.g. CUDA) may expect these operations, but they could be
implemented on SecureRISC with CAS instructions. C++20 added atomic
operations on floating-point types, but these are best done using CAS
(e.g. it is not appropriate to have floating-point addition in memory
controller for uncached operations).
Atomic operations may be performed by the processor on coherent memory locations in the cache by holding off coherency transactions during the operations involved, or on uncached locations by sending a transaction to the memory, which performs the operation atomically there and returns the result. The System Interconnect Address Attributes section describes main memory attributes indicating which locations implement uncached atomic memory operations. The locations to be modified by atomic operations must not cross a 64‑byte boundary; for example, the address for CASS64 must be in the range 0..56 mod 64.
SecureRISC will have the usual instructions to wait for an interrupt. Such instructions increase efficiency. While the details are TBD, for example, there might be a WAIT instruction that takes a value to write to InterruptEnable[PC.ring], and then suspends fetch before the next instruction (so that the return from the interrupt exception returns to that instruction). A more interesting instruction under consideration is one that waits for a memory location to change, which may be useful for reducing the overhead of memory based synchronization.
TBD
Note: SecureRISC has acquire and release options for loads and stores, which reduces (but does not eliminate) the need for some memory fences. Fences for virtual memory changes may be necessary, though it may be possible to handle those in the coherence protocol. Some fence instructions may also be useful in mitigating security vulnerabilities due to microarchitecture bugs.
SecureRISC has instructions for compiler-directed prefetching and to
control automatic prefetching. These instructions operate on
8‑word cache lines. The C prefix to
these assembler mnemonics represents Cache
. Rather than identify
caches as L1 BBDC, L1 Instruction, L2 Instruction, L1 Data, L2 Data, L3,
etc. we designate caches by referencing the instructions that use those
caches. Further work is required for things that operate on or stop at
some intermediate level of the hierarchy. These instructions operate on
cache block specified by an lvaddr and are subject to access
permissions. They are available to all rings. There will be privileged
instructions not yet listed here.
SecureRISC requires that writes invalidate or update all caches that contain previous fetches, including the BBDC and L1 and L2 Instruction caches. Previously fetched instructions still in the pipeline are not invalidated, so a fence is required. Thus, cache operations are not required for JIT compilers, merely the fence. This is typically implemented by having a directory at what ARM calls the Point of Unification (PoU) in the cache hierarchy. This directory records the locations in lower levels which may contain the a copy. Stores consult the directory and when other locations are noted, those locations are invalidated or updated. For multiprocessor systems, a first processor may write instructions that a second will execute. The first processor must execute a fence to ensure all writes have completed before signaling the second processor to proceed. The second processor must also use a fence to ensure that the pipeline has stale instructions (e.g. fetched speculatively). The details will be spelled out when the fence instructions are specified.
Is TLB prefetching required?
Instruction | Operation |
---|---|
Fetch prefetching and eviction designation (these may be executed too late in the pipeline to be useful an so may be replaced by BBD features) |
|
CPBB | Prefetch into Basic Block Descriptor Cache (BBDC) |
CPI | Prefetch into Basic Block Descriptor Cache (BBDC) and Instruction Cache |
CEBB | Designate eviction candidate for Basic Block Descriptor Cache (BBDC) |
CEI | Designate eviction candidate for Basic Block Descriptor Cache (BBDC) and Instruction Caches |
Early pipeline prefetching, zeroing, writeback, invalidation, and eviction designation | |
CPLA | Prefetch for LA/LAC/etc. |
CPLX |
Prefetch for LX/etc. (probably identical to CPLA in most cases) |
CPSA | Prefetch for SA/SAC/etc. |
CPSX |
Prefetch for SX/etc. (probably identical to CPSA in most cases) |
CZA | Zero cache block used for SA/SAC/etc. |
CZX |
Zero cache block used for SX/etc. (probably identical to CZA in most cases) |
CEA | Designate eviction candidate for LA/SA |
CEX | Designate eviction candidate for LX/SX |
CCX | Clean (writeback) for SX cache |
CCIX | Clean (writeback) and invalidate for SX cache |
Late pipeline prefetching, zeroing, writeback, invalidation, and eviction designation (the primary difference from early prefetching is some microarchitectures may not prefetch to the first stage(s) of the data cache hierarchy) |
|
CPLS | Prefetch for LS |
CPSS | Prefetch for SS |
CZS | Zero cache block used for SS/etc. |
CES | Designate eviction candidate for LS/SS |
CCS | Clean (writeback) for SS cache |
CCIS | Clean (writeback) and invalidate for SS cache |
The primary issue with fetch prefetching is that some implementations may execute explicit instructions too late to be useful. Eventually I expect to define new next codes in Basic Block Descriptors for L1 BBDC and L1 Instruction Cache prefetch and eviction designation to solve this problem. Whether some of the above instructions are removed by such a solution is TBD.
Prefetch may want additional options for rereference interval prediction and similar hints to avoid removing useful cache blocks when streaming data larger than the cache size.
It is likely appropriate to add some instructions that exist only for code size reduction, which expand into multiple SecureRISC instructions early in the pipeline (e.g. before register renaming). The best candidates for this so far are doubleword load/store instructions, which would expand into two singleword load/store instructions. This expansion and execution as separate instructions in the backend of the pipeline avoids the issues with register renaming that would otherwise exist. The partial execution of part of the pair would be allowed (and loads to the source registers would be not allowed). Doubleword load/store significantly reduce the size of function call entry and exit and may be useful for loading a code pointer and context pointer pair for indirect calls.
The following outlines some of the instructions without giving them their full definitions, which includes tag and bounds checking. The full definitions will follow later.
The 16‑bit instruction formats are included for code density. Some evaluation of whether it is worth the cost should be considered. Note that the BB descriptor gives the sizes of all instructions in the basic block in the form of the start bit mask, and so the instruction size is not encoded in the opcodes. The start mask allows multiple instructions to be decoded in parallel without parsing the instruction stream; in effect it provides an extra bit of information for every 16 bits of the instruction stream.
Because the identical or nearly identical instructions may exist in multiple, a convention for distinguishing them is required. Since 32‑bit instructions are most common, these have the shortest form. Mnemonics for instruction sizes other than 32 bits are indicated by their first letter:
Size | Mnemonic prefix |
---|---|
16 | 1 |
32 | |
48 | 3 |
64 | 4 |
Instructions that calculate an effective address are distinguished by the first letter of their mnemonic: Address, Load, or Store. For loads and stores, the second letter of the mnemonic gives the destination register file either A for ARs, X for XRs, S for SRs, M for VMs, or V for VRs. (There are no loads or stores to the BRs.) The next field of the mnemonic is empty for word loads and stores, or the size (8, 16, 32, or 64—possibly 128?) for sub-word loads and stores to the XRs or SRs. Word stores must be word-aligned, but 64‑bit (possibly 128? sub-word stores may be misaligned and generate an integer tag. Sub-word loads for 8, 16, or 32 bits next include S for signed or U for unsigned. Finally, the last letter is I for an immediate offset (as opposed to a XR[b] offset).
As examples of the above rules: A stores the address calculation AR[a] +p XR[b]<<sa to destination AR[d] while 1AI stores the address calculation AR[a] +p imm8. LA and LAI loads the contents of those two address calculation to the destination AR[d]. LX32U loads XR[d] from an unsigned 32‑bit memory location located using a XR[b] offset and LS16SI loads SR[d] from a signed 16‑bit memory location located using an immediate offset.
Arithmetic instructions use the operation (e.g. ADD or SUB) with a suffix X or S for the register file of the source and destination operands. If an immediate value is one of the operands, a final I is appended. For vector operations the suffixes are VV for vector-vector VS for vector-scalar, and VI for vector-scalar immediate.
As examples of the above rules:
Assembler | Simplified meaning (ignoring details) |
---|---|
ADDXI d, a, imm | XR[d] ← XR[a] + imm |
ADDS d, a, b | SR[d] ← SR[a] + SR[b] |
For Floating-Point operations, F is used for
IEEE754 binary32 (single-precision), D is
used for IEEE754 binary64
(double-precision), H is used for IEEE754
binary16 (half-precision), B is used for the
non-standard Machine Learning (ML) 16‑bit Brain Float
format,
and P3, P4,
and P5 are used for the three proposed IEEE
binary8pp formats for ML quarter-precision (8‑bit) with
5‑bit, 4‑bit, 3‑bit
exponents. Q is reserved for a future
IEEE754 binary128 (quad-precision).
Some floating-point examples are as follows:
Assembler | Simplified meaning (ignoring details) | Comment |
---|---|---|
FNMADDS d, a, b, c | SR[d] ← −(SR[a] ×f SR[b]) +f SR[c] | |
DMADDVS d, a, b, c | VR[d] ← (VR[a] ×d SR[b]) +d VR[c] | |
P4MBSUBVV d, c, a, b | VR[d] ← (VR[a] ×p4b VR[b]) −b VR[c] | P4 widening to BF multiply-subtract |
Mnemonic | Definition | Comment | Exp | Prec |
---|---|---|---|---|
Q | binary128 | quadruple-precision | 15 | 113 |
D | binary64 | double-precision | 11 | 53 |
F | binary32 | single-precision | 8 | 24 |
H | binary16 | half-precision | 5 | 11 |
B | bfloat16 | ML alternative to half-precision | 8 | 8 |
P5 | binary8p5 | quarter-precision for ML alternative | 3 | 5 |
P4 | binary8p4 | quarter-precision for ML alternative | 4 | 4 |
P3 | binary8p3 | quarter-precision for ML | 5 | 3 |
In the following sections, sometimes a set of instructions are defined with a mnemonic schema using the following:
What | Schema | Mnemonic | Definition | Comment |
---|---|---|---|---|
Operation Mnemonic schemas for ARs | ||||
Address Comparison |
ac | EQ | x63..0 = y63..0 | |
NE | x63..0 ≠ y63..0 | |||
LTU | x63..0 <u y63..0 | |||
GEU | x63..0 ≥u y63..0 | |||
TEQ | x71..64 = y7..0 | tag equal | ||
TNE | x71..64 ≠ y7..0 | tag not-equal | ||
TLTU | x71..64 <u y7..0 | tag less than | ||
TGEU | x71..64 ≥u y7..0 | tag greater than or equal | ||
WEQ | x71..0 = y71..0 | word equal | ||
WNE | x71..0 ≠ y71..0 | word not-equal | ||
WLTU | x71..0 <u y71..0 | word less than | ||
WGEU | x71..0 ≥u y71..0 | word greater than or equal | ||
Operation Mnemonic schemas for XRs | ||||
Index Arithmetic |
xa | ADD | x63..0 + y63..0 | mod 264 addition |
SUB | x63..0 − y63..0 | mod 264 subtraction | ||
MINU | minu(x63..0, y63..0) | |||
MINS | mins(x63..0, y63..0) | |||
MINUS | minus(x63..0, y63..0) | |||
MAXU | maxu(x63..0, y63..0) | |||
MAXS | maxs(x63..0, y63..0) | |||
MAXUS | maxus(x63..0, y63..0) | |||
Index Logical | xl | AND | x63..0 & y63..0 | |
OR | x63..0 | y63..0 | |||
XOR | x63..0 ^ y63..0 | |||
SLL | x63..0 <<u y5..0 | |||
SRL | x63..0 >>u y5..0 | |||
SRA | x63..0 >>s y5..0 | |||
Index Comparison |
xc | EQ | x63..0 = y63..0 | |
NE | x63..0 ≠ y63..0 | |||
LTU | x63..0 <u y63..0 | |||
LT | x63..0 <s y63..0 | |||
GEU | x63..0 ≥u y63..0 | |||
GE | x63..0 ≥s y63..0 | |||
NONE | (x63..0&y63..0)=0 | Check statistics | ||
ANY | (x63..0&y63..0)≠0 | Check statistics | ||
ALL | (x63..0&~y63..0)=0 | Check statistics | ||
NALL | (x63..0&~y63..0)≠0 | Check statistics | ||
BITC | xy5..0=0 | Check statistics | ||
BITS | xy5..0≠0 | Check statistics | ||
TEQ | x71..64 = y7..0 | tag equal | ||
TNE | x71..64 ≠ y7..0 | tag not-equal | ||
TLTU | x71..64 <u y7..0 | tag less than | ||
TGEU | x71..64 ≥u y7..0 | tag greater than or equal | ||
WEQ | x71..0 = y71..0 | word equal | ||
WNE | x71..0 ≠ y71..0 | word not-equal | ||
WLTU | x71..0 <u y71..0 | word less than | ||
WGEU | x71..0 ≥u y71..0 | word greater than or equal | ||
Operation Mnemonic schemas for SRs, BRs, VRs, and VMs | ||||
Boolean | bo | AND | x & y | |
ANDTC | x & ~y | |||
NAND | ~(x & y) | |||
NOR | ~(x | y) | |||
OR | x | y | |||
ORTC | x | ~y | |||
XOR | x ^ y | |||
EQV | x ^ ~y | |||
Boolean accumulation |
ba | AND | x & y | |
OR | x | y | OR with b0 used by assembler for non-accumulation | ||
Integer Comparison |
ic | EQ | x63..0 = y63..0 | |
NE | x63..0 ≠ y63..0 | |||
LTU | x63..0 <u y63..0 | |||
LT | x63..0 <s y63..0 | |||
GEU | x63..0 ≥u y63..0 | |||
GE | x63..0 ≥s y63..0 | |||
NONE | (x63..0&y63..0)=0 | Check statistics | ||
ANY | (x63..0&y63..0)≠0 | Check statistics | ||
ALL | (x63..0&~y63..0)=0 | Check statistics | ||
NALL | (x63..0&~y63..0)≠0 | Check statistics | ||
BITC | xy5..0=0 | Check statistics | ||
BITS | xy5..0≠0 | Check statistics | ||
Integer Arithmetic |
io | ADD | x63..0 + y63..0 | mod 264 addition |
SUB | x63..0 − y63..0 | mod 264 subtraction | ||
ADDOU | x63..0 +u y63..0 | Trap on unsigned overflow | ||
ADDOS | x63..0 +s y63..0 | Trap on signed overflow | ||
ADDOUS | x63..0 +us y63..0 | Trap on unsigned-signed overflow | ||
SUBOU | x63..0 −u y63..0 | Trap on unsigned overflow | ||
SUBOS | x63..0 −s y63..0 | Trap on signed overflow | ||
SUBOUS | x63..0 −us y63..0 | Trap on unsigned-signed overflow | ||
MINU | minu(x63..0, y63..0) | |||
MINS | mins(x63..0, y63..0) | |||
MINUS | minus(x63..0, y63..0) | |||
MAXU | maxu(x63..0, y63..0) | |||
MAXS | maxs(x63..0, y63..0) | |||
MAXUS | maxus(x63..0, y63..0) | |||
MUL | x63..0 × y63..0 | least-significant 64 bits of product | ||
MULOU | x63..0 ×u y63..0 | Trap on unsigned overflow | ||
MULOS | x63..0 ×s y63..0 | Trap on signed overflow | ||
MULUS | x63..0 ×us y63..0 | Trap on unsigned-signed overflow | ||
Integer 1-operand (should these be logical accumulations instead?) |
a1 | NEG | − x63..0 | negate |
ABS | abs(x63..0) | absolute value | ||
POPC | popcount(x63..0) | count number of one bits | ||
COUNTS | countsign(x63..0) | count most-significant bits equal to sign bit | ||
COUNTMS0 | countms0(x63..0) | |||
COUNTMS1 | countms1(x63..0) | |||
COUNTLS0 | countls0(x63..0) | |||
COUNTLS1 | countls1(x63..0) | |||
Integer Arithmetic accumulation |
ia | ADD | x63..0 + y63..0 | |
SUB | x63..0 − y63..0 | |||
y63..0 |
Non-accumulation Mnemonic omitted in assembler: e.g. just ADDS d, a, b is encoded with this ia to perform SR[d] ← SR[a] + SR[b] |
|||
Bitwise Logical | lo | AND | x63..0 & y63..0 | |
ANDTC | x63..0 & ~y63..0 | |||
NAND | ~(x63..0 & y63..0) | |||
NOR | ~(x63..0 | y63..0) | |||
OR | x63..0 | y63..0 | |||
ORTC | x63..0 | ~y63..0 | |||
XOR | x63..0 ^ y63..0 | |||
EQV | x63..0 ^ ~y63..0 | |||
SLL | x63..0 <<u y5..0 | |||
SRL | x63..0 >>u y5..0 | |||
SRA | x63..0 >>s y5..0 | |||
CLMUL | x63..0 ⊗ y63..0 | Carryless multiplication | ||
Bitwise Logical accumulation |
la | AND | x63..0 & y63..0 | |
OR | x63..0 | y63..0 | |||
XOR | x63..0 ^ y63..0 | Primarily for CLMUL | ||
y63..0 |
Non-accumulation Mnemonic omitted in assembler: e.g. just ANDS d, [c,] a, b is encoded with this la to perform SR[d] ← SR[a] & SR[b] with SR[c] ignored. |
|||
Floating-Point Arithmetic |
fo | ADD | x +fmt y | |
SUB | x −fmt y | |||
MIN | minfmt(x, y) | |||
MAX | maxfmt(x, y) | |||
M | x ×fmt y | |||
NM | −(x ×fmt y) | negative multiply | ||
Mw | w(x) ×w w(y) | widening multiply | ||
NMw | −(w(x) ×w w(y)) | widening negative multiply | ||
DIV | x63..0 ÷fmt y63..0 | Must be no-accumulation | ||
Floating-Point accumulation |
fa | ADD | x +fmt y | |
SUB | x −fmt y | |||
y63..0 |
Non-accumulation Mnemonic omitted in assembler: e.g. just DADDS d, a, b is encoded with this fa to perform SR[d] ← SR[a] +d SR[b] |
|||
Floating-Point 1-operand |
f1 | MOV | x | |
NEG | −fmt x63..0 | |||
ABS | absfmt(x63..0) | |||
RECIP | 1.0 ÷fmt x63..0 | |||
SQRT | sqrtfmt(x63..0) | |||
RSQRT | rsqrtfmt(x63..0) | |||
FLOOR | floorfmt(x63..0) | |||
CEIL | ceilfmt(x63..0) | |||
TRUNC | truncfmt(x63..0) | |||
ROUND | roundfmt(x63..0) | |||
CVTI | converti,fmt(x63..0) | |||
CVTB | convertb,fmt(x63..0) | |||
CVTH | converth,fmt(x63..0) | |||
CVTF | convertf,fmt(x63..0) | |||
CVTD | convertd,fmt(x63..0) | |||
FLOATU | floatfmt,u(x63..0, imm) | |||
FLOATS | floatfmt,s(x63..0, imm) | |||
CLASS | classfmt(x63..0) | |||
Floating-Point Comparison |
fc | OR | x63..0 ?fmt y63..0 | ordered |
UN | x63..0 ~?fmt y63..0 | unordered | ||
EQ | x63..0 =fmt y63..0 | |||
NE | x63..0 ≠fmt y63..0 | |||
LT | x63..0 <fmt y63..0 | |||
GE | x63..0 ≥fmt y63..0 | |||
LE | x63..0 ≤fmt y63..0 | |||
GT | x63..0 >fmt y63..0 |
Class | Schema | Definition | Examples |
---|---|---|---|
Integer Arithmetic | ioia | SR[d] ← SR[c] ia (SR[a] io SR[b]) | MADDS |
Bitwise Logical | lola | SR[d] ← SR[c] la (SR[a] lo SR[b]) | ANDORS |
Floating-Point | fofa | SR[d] ← SR[c] fafmt (SR[a] fofmt SR[b]) | DNMSUBS |
Boolean | boba | BR[d] ← BR[c] ba (BR[a] bo BR[b]) | ANDORS |
VM[d] ← VM[c] ba (VM[a] bo VM[b]) | ORANDM | ||
Integer Comparison | icba | BR[d] ← BR[c] ba (SR[a] ic SR[b]) | LTUANDS |
Floating-Point Comparison | fcba | BR[d] ← BR[c] ba (SR[a] fc SR[b]) | DLTANDS |
Value | Mnemonic | Function | Mnemonic | Function |
---|---|---|---|---|
0000 | F | 0 | ||
0001 | NOR | a ~| b | ANDCC | ~a & ~b |
0010 | ANDTC | a & ~b | ||
0011 | NOTB | ~b | ||
0100 | ANDCT | ~a & b | ||
0101 | NOTA | ~a | ||
0110 | XOR | a ^ b | ||
0111 | NAND | a ~& b | ORCC | ~a | ~b |
1000 | AND | a & b | ||
1001 | EQV | a ~^ b | XNOR | ~(a ^ b) |
1010 | A | a | ||
1011 | ORTC | a | ~b | ||
1100 | B | b | ||
1101 | ORCT | ~a | b | ||
1110 | OR | a | b | ||
1111 | T | 1 |
m | What | Example mnemonic |
---|---|---|
0 | Reserved | |
1 | Unsigned | MINU |
2 | Signed | MAXS |
3 | Unsigned Signed | MINUS |
m | What | Example mnemonic |
---|---|---|
0 | wrap | ADD |
1 | Overflow Unsigned | SUBOU |
2 | Overflow Signed | MULOS |
3 | Overflow Unsigned Signed | ADDOUS |
field TBD |
Static | Dynamic |
---|---|---|
0 | Nearest, ties to Even | |
1 | Round to odd | |
2 | Nearest, ties to Min Magnitude | |
3 | Nearest, ties to Max Magnitude | |
4 | Toward −∞ (floor) | |
5 | Toward +∞ (ceiling) | |
6 | Toward 0 (truncate) | |
7 | Dynamic | Away from 0 |
field TBD | Data width |
Aligned | MemTag check |
Examples |
---|---|---|---|---|
0 | 8 | 240..245 | LX8U, LS8SI | |
1 | 16 | 240..245 | LS16S, LS16UI | |
2 | 32 | 240..245 | SX32I, SS32 | |
3 | 64 | 240..252 | LX64UI, SS64 | |
4 | 72 | word | LAI, LX, LS, SSI | |
5 | 144 | doubleword | LAD, SADI | |
6 | 144 | doubleword | 232/251 | LAC, SAC |
7 | 64 | clique | CLA64, CSA64 |
field TBD | Mnemonic | Semantics |
---|---|---|
0 | Neither acquire nor release | |
1 | .a | Acquire |
2 | .r | Release |
3 | .ar | Acquire and Release |
The table below lists the indexed load/store opcode mnemonics, but the
same encoding is used for the immediate offset opcodes (i.e. with the
appended I suffix). The {L,S}{X,S}128
instructions marked with a ?
are possible placeholders for future
code density instructions that expand into a pair of load or store
instructions, similar to the existing {L,S}{X,S}D instructions.
field TBD |
Reg file |
Operation | field TBD | |||||||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |||
0 | XR | Load Unsigned | LX8U | LX16U | LX32U | LX64U | LX | LXD | LX128? | |
1 | SR | Load Unsigned | LS8U | LS16U | LS32U | LS64U | LS | LSD | LS128? | |
2 | XR | Speculative Load Unsigned | SLX8U | SLX16U | SLX32U | SLX64U | SLX | SLXD | ||
3 | SR | Speculative Load Unsigned | SLS8U | SLS16U | SLS32U | SLS64U | SLS | SLSD | ||
4 | XR | Load Signed | LX8S | LX16S | LX32S | LX64S | ||||
5 | SR | Load Signed | LS8S | LS16S | LS32S | LS64S | ||||
6 | XR | Speculative Load Signed | SLX8S | SLX16S | SLX32S | SLX64S | ||||
7 | SR | Speculative Load Signed | SLS8S | SLS16S | SLS32S | SLS64S | ||||
8 | AR | Load | RLA32 | RLA64 | LA | LAD | LAC | CLA64 | ||
9 | VM | Load | LM | |||||||
10 | AR | Speculative Load | SRLA32 | SRLA64 | SLA | SLAD | SLAC | SCLA64 | ||
11 | Reserved | |||||||||
12 | XR | Store | SX8 | SX16 | SX32 | SX64 | SX | SXD | SX128? | |
13 | SR | Store | SS8 | SS16 | SS32 | SS64 | SS | SSD | SS128? | |
14 | AR | Store | RSA32 | RSA64 | SA | SAD | SAC | CSA64 | ||
15 | VM | Store | SM |
n | Suffix | What | Example | m usage | f usage |
---|---|---|---|---|---|
0 | S | Scalar integer |
SR[d] ← SR[c] ia (SR[a] io SR[b]) | su or osu | |
B | Boolean | BR[d] ← BR[c] ba (BR[a] bo BR[b]) | |||
S | Scalar floating |
SR[d] ← SR[c] fafmt (SR[a] fofmt SR[b]) | round | ||
1 | SV | Vector reduction to scalar integer |
SR[d] ← reduction(VR[a]) | ||
SV | Vector reduction to scalar floating |
SR[d] ← reductionfmt(VR[a]) | round | ||
M | Mask | VM[d] ← VM[c] ba (VM[a] bo VM[b]) | |||
2 | VS | Vector Scalar integer |
VR[d] ← VR[c] ia (VR[a] io SR[b]) masked by VM[e] |
vector mask |
|
VS | Vector Scalar floating |
VR[d] ← VR[c] fafmt (VR[a] fofmt SR[b]) masked by VM[e] |
vector mask |
round | |
VI | Vector Immediate integer |
VR[d] ← VR[a] io imm masked by VM[e] |
vector mask |
||
VS | Vector Scalar integer compare |
VM[d] ← VM[c] ba (VR[a] ic SR[b]) | |||
VI | Vector Immediate integer compare |
VM[d] ← VR[a] ic imm | |||
VS | Vector Scalar floating compare |
VM[d] ← VM[c] ba (VR[a] fcfmt SR[b]) | |||
3 | VV | Vector Vector integer |
VR[d] ← VR[c] ia (VR[a] io VR[b]) masked by VM[e] |
vector mask |
|
VV | Vector Vector floating |
VR[d] ← VR[c] fafmt (VR[a] fofmt VR[b]) masked by VM[e] |
vector mask |
round | |
VV | Vector Vector integer compare |
VM[d] ← VM[c] ba (VR[a] ic VR[b]) | |||
VV | Vector Vector floating compare |
VM[d] ← VM[c] ba (VR[a] fcfmt VR[b]) |
The following are a sketch of the 16‑bit instruction encodings, but the actual encodings will be determined by analyzing instruction frequency in the 32‑bit instruction set.
1:0 3:2 |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 1A | 1LA | 1SA | i16da |
1 | 1AI | 1LAI | 1SAI | i16ab0 |
2 | 1ADDX | 1LX | 1SX | 1XI |
3 | 1ADDXI | 1LXI | 1SXI | i16ab1 |
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
b | a | d | op16 | ||||||||
4 | 4 | 4 | 4 |
Word address calculation with indexed addressing: 1A | ||
1A | d, a, b |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<3) = 0 AR[d] ← AR[a] +p XR[b]<<3 AV[d] ← v |
Index register addition | ||
1ADDX | d, a, b |
XR[d] ← XR[a] + XR[b] XV[d] ← XV[a] & XV[b] |
Non-speculative tagged word loads with indexed addressing: L{A,X,S} (LS in 32‑bit table) |
||
1LA | d, a, b |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<3) = 0 AR[d] ← v ? lvload72(AR[a] +p XR[b]<<3) : 0 AV[d] ← v |
1LX | d, a, b |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<3) = 0 XR[d] ← v ? lvload72(AR[a] +p XR[b]<<3) : 0 XV[d] ← v |
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
imm4 | a | d | op16 | ||||||||
4 | 4 | 4 | 4 |
Word address calculation with immediate addressing: AI | ||
1AI | d, a, imm4 |
v ← AV[a] trap if v & boundscheck(AR[a], imm4<<3) = 0 AR[d] ← AR[a] +p imm4<<3 AV[d] ← v |
Index register addition immediate | ||
1ADDXI | d, a, imm4 |
XR[d] ← XR[a] + imm4 XV[d] ← XV[a] |
Non-speculative tagged word loads with indexed addressing: L{A,X,S}I (LSI and wider immediate LA and LX in 32‑bit table) |
||
1LAI | d, a, imm4 |
v ← AV[a] trap if v & boundscheck(AR[a], imm4<<3) = 0 AR[d] ← v ? lvload72(AR[a] +p imm4<<3) : 0 AV[d] ← v |
1LXI | d, a, imm4 |
v ← AV[a] trap if v & boundscheck(AR[a], imm4<<3) = 0 XR[d] ← v ? lvload72(AR[a] +p imm4<<3) : 0 XV[d] ← v |
13:12 15:14 |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 1NEGX | 1NOTX | 1MOVSX | 1MOVXS |
1 | 1RTAGX | 1RTAGA | 1RSIZEA | |
2 | 1MOVAX | 1MOVXA | 1MOVAS | 1MOVSA |
3 | 1SOBX | 1RTAGS |
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
op16da | a | d | i16da | ||||||||
4 | 4 | 4 | 4 |
1NEGX | d, a |
XR[d] ← −XR[a] XV[d] ← XV[a] |
1NOTX | d, a |
XR[d] ← ~XR[a] XV[d] ← AV[a] |
1MOVAX | d, a |
AR[d] ← XR[a] AV[d] ← XV[a] |
1MOVXA | d, a |
XR[d] ← AR[a] XV[d] ← AV[a] |
1MOVSX | d, a |
SR[d] ← XR[a] SV[d] ← XV[a] |
1MOVXS | d, a |
XR[d] ← SR[a] XV[d] ← SV[a] |
1MOVAS | d, a |
AR[d] ← SR[a] AV[d] ← SV[a] |
1MOVSA | d, a |
SR[d] ← AR[a] SV[d] ← AV[a] |
1RTAGA | d, a |
XR[d] ← 240 ∥ 056 ∥ AR[a]71..64 XV[d] ← XV[a] |
1RTAGX | d, a |
XR[d] ← 240 ∥ 056 ∥ XR[a]71..64 XV[d] ← XV[a] |
1RTAGS | d, a |
SR[d] ← 240 ∥ 056 ∥ SR[a]71..64 SV[d] ← SV[a] |
1RSIZEA | d, a |
XR[d] ← 240 ∥ 03 ∥ AR[a]132..72 XV[d] ← XV[a] |
1SOBX |
trap if XV[a] = 0 XR[d] ← XR[a] − 1 XV[d] ← 1 loop back if XR[d] ≠ 0 |
1NEGX is identical to RSUBXI with an immediate of 0 but is 16 bits rather than 32. Whether this is important is unclear. 1NOTX is identical to RSUBXI with an immediate of -1 but is 16 bits rather than 32. Whether this is important is unclear. Whether to include these will depend on code size statistics.
5:3 7:6 |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 1BEQA | 1BNEA | 1BLTAU | 1BGEAU |
1 | 1BEQX | 1BNEX | 1BLTXU | 1BGEXU |
2 | 1BNONEX | 1BANYX | 1BLTX | 1BGEX |
3 | i16a0 |
5:4 7:6 |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 1TEQA | 1TNEA | 1TLTAU | 1TGEAU |
1 | 1TEQX | 1TNEX | 1TLTXU | 1TGEXU |
2 | 1TNONEX | 1TANYX | 1TLTX | 1TGEX |
3 | i16a1 |
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
b | a | op16ab | i16ab | ||||||||
4 | 4 | 4 | 4 |
All of the following first do either
trap if AV[a] & AV[b] = 0
or
trap if XV[a] & XV[b] = 0
as appropriate.
1BEQA | a, b | branch if AR[a] = AR[b] |
1BEQX | a, b | branch if XR[a] = XR[b] |
1BNEA | a, b | branch if AR[a] ≠ AR[b] |
1BNEX | a, b | branch if XR[a] ≠ XR[b] |
1BLTAU | a, b | branch if AR[a] <u AR[b] |
1BLTXU | a, b | branch if XR[a] <u XR[b] |
1BGEAU | a, b | branch if AR[a] ≥u AR[b] |
1BGEXU | a, b | branch if XR[a] ≥u XR[b] |
1BLTX | a, b | branch if XR[a] <s XR[b] |
1BGEX | a, b | branch if XR[a] ≥s XR[b] |
1BNONEX | a, b | branch if (XR[a] & XR[b]) = 0 |
1BANYX | a, b | branch if (XR[a] & XR[b]) ≠ 0 |
1TEQA | a, b | trap if AR[a] = AR[b] |
1TEQX | a, b | trap if XR[a] = XR[b] |
1TNEA | a, b | trap if AR[a] ≠ AR[b] |
1TNEX | a, b | trap if XR[a] ≠ XR[b] |
1TLTAU | a, b | trap if AR[a] <u AR[b] |
1TLTXU | a, b | trap if XR[a] <u XR[b] |
1TGEAU | a, b | trap if AR[a] ≥u AR[b] |
1TGEXU | a, b | trap if XR[a] ≥u XR[b] |
1TLTX | a, b | trap if XR[a] <s XR[b] |
1TGEX | a, b | trap if XR[a] ≥s XR[b] |
1TNONEX | a, b | trap if (XR[a] & XR[b]) = 0 |
1TANYX | a, b | trap if (XR[a] & XR[b]) ≠ 0 |
13:12 15:14 |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 1BEQNA | 1BNENA | 1BF | 1BT |
1 | 1BEQZX | 1BNEZX | 1BLTZX | 1BGEZX |
2 | 1JMPA | |||
3 | 1SWITCHX | 1BLEZX | 1BGTZX |
13:12 15:14 |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 1TEQNA | 1TNENA | 1TF | 1TT |
1 | 1TEQZX | 1TNEZX | 1TLTZX | 1TGEZX |
2 | 1CHKVA | 1CHKVX | 1CHKVS | |
3 | 1TLEZX | 1TGTZX |
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
op16a | a | i16a | i16ab | ||||||||
4 | 4 | 4 | 4 |
All of the following first do either
trap if AV[a] = 0
or
trap if XV[a] = 0
or
trap if BV[a] = 0
as appropriate.
1BEQNA | a | branch if AR[a]63..0 = 0 |
1BNENA | a | branch if AR[a]63..0 ≠ 0 |
1BEQZX | a | branch if XR[a]63..0 = 0 |
1BNEZX | a | branch if XR[a]63..0 ≠ 0 |
1BLTZX | a | branch if XR[a]63..0 <s 0 |
1BGEZX | a | branch if XR[a]63..0 ≥s 0 |
1BLEZX | a | branch if XR[a]63..0 ≤s 0 |
1BGTZX | a | branch if XR[a]63..0 >s 0 |
1BF | a | branch if BR[a] = 0 |
1BT | a | branch if BR[a] ≠ 0 |
1TEQZX | a | trap if XR[a]63..0 = 0 |
1TNEZX | a | trap if XR[a]63..0 ≠ 0 |
1TLTZX | a | trap if XR[a]63..0 <s 0 |
1TGEZX | a | trap if XR[a]63..0 ≥s 0 |
1TLEZX | a | trap if XR[a]63..0 ≤s 0 |
1TGTZX | a | trap if XR[a]63..0 >s 0 |
1TF | a | trap if BR[a] = 0 |
1TT | a | trap if BR[a] ≠ 0 |
1JMPA | a |
trap if AR[a]71..68 ≠ 12 trap if AR[a]2..0 ≠ 0 trap if PC66..64 ≠ AR[a]66..64 PC ← AR[a]66..0 |
1SWITCHX | a |
trap if XR[a]71..65 ≠ 120 PC ← PC +p (XR[a]<<3) |
1CHKVA | a | trap if AV[a] = 0 |
1CHKVX | a | trap if XV[a] = 0 |
1CHKVS | a | trap if SV[a] = 0 |
15 | 8 | 7 | 4 | 3 | 0 | |||
imm8 | d | op16 | ||||||
8 | 4 | 4 |
1XI | d, imm8 |
XR[d] ← 240 ∥ imm8748 ∥ imm8 XV[d] ← 1 |
1:0 3:2 |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | AXload | AXstore | SVload | SVstore |
1 | ||||
2 | ARop | XRop | SRop | VRop |
3 | XI | XUI | SI | SUI |
31 | 28 | 27 | 24 | 23 | 22 | 21 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op32g | f | n | m | c | b | a | d | op32 | ||||||||||||||||
4 | 4 | 2 | 2 | 4 | 4 | 4 | 4 | 4 |
Scalar Integer | ||
ioiaS | d, c, a, b |
SR[d] ← SR[c] ia (SR[a] io SR[b]) SV[d] ← SV[a] & SV[b] & SV[c] |
lolaS | d, c, a, b |
SR[d] ← SR[c] la (SR[a] lo SR[b]) SV[d] ← SV[a] & SV[b] & SV[c] |
SELS | d, c, a, b |
SR[d] ← BR[c] ? SR[a] : SR[b] SV[d] ← BV[c] & (BR[c] ? SV[a] : SV[b]) |
i1S | d, a |
SR[d] ← i1(SR[a]) SV[d] ← SV[a] |
Scalar Integer Multiword | ||
FUNS | d, b, a, c |
t ← (SR[b]63..0∥SR[a]63..0) >> SR[c]5..0 SR[d] ← 240 ∥ t63..0 SV[d] ← SV[a] & SV[b] & SV[c] |
ROTRS | d, a, b | assembler expands to FUNS d, a, a, b |
FUNNS | d, b, a, c |
t ← (SR[b]63..0∥SR[a]63..0) >> (−SR[c])5..0 SR[d] ← 240 ∥ t63..0 SV[d] ← SV[a] & SV[b] & SV[c] |
ROTLS | d, a, b | assembler expands to FUNNS d, a, a, b |
ADDC | d, b, a |
trap if (SV[a] & SV[b]) = 0 t ← SR[a]63..0 + SR[b]63..0 + CARRY0 CARRY ← 063 ∥ t64 SR[d] ← 240 ∥ t63..0 SV[d] ← 1 |
MULC | d, b, a, c |
trap if (SV[a] & SV[b] & SV[c]) = 0 t ← (SR[a]63..0 ×u SR[b]63..0) + SR[c]63..0 + CARRY CARRY ← t127..64 SR[d] ← 240 ∥ t63..0 SV[d] ← 1 |
DIVC | d, b, a, c |
trap if (SV[a] & SV[b]) = 0 q,r ← (CARRY∥SR[a]63..0) ÷u SR[b]63..0 CARRY ← r SR[d] ← 240 ∥ q SV[d] ← 1 |
Boolean | ||
boba | d, c, a, b |
BR[d] ← BR[c] ba (BR[a] bo BR[b]) BV[d] ← BV[a] & BV[b] & BV[c] |
bobaM | d, c, a, b | VR[d] ← VM[c] ba (VM[a] bo VM[b]) |
Integer Comparison | ||
acbaA | d, c, a, b |
BR[d] ← BR[c] ba (AR[a] ac AR[b]) BV[d] ← AV[a] & AV[b] & BV[c] |
xcbaX | d, c, a, b |
BR[d] ← BR[c] xa (XR[a] xc XR[b]) BV[d] ← XV[a] & XV[b] & BV[c] |
icbaS | d, c, a, b |
BR[d] ← BR[c] ba (SR[a] ic SR[b]) BV[d] ← SV[a] & SV[b] & BV[c] |
Floating-Point | ||
Df1S | d, a |
SR[d] ← f1d(SR[a]) SV[d] ← SV[a] |
Ff1S | d, a |
SR[d] ← f1f(SR[a]) SV[d] ← SV[a] |
Hf1S | d, a |
SR[d] ← f1h(SR[a]) SV[d] ← SV[a] |
Bf1S | d, a |
SR[d] ← f1b(SR[a]) SV[d] ← SV[a] |
P4f1S | d, a |
SR[d] ← f1p4(SR[a]) SV[d] ← SV[a] |
P3f1S | d, a |
SR[d] ← f1p3(SR[a]) SV[d] ← SV[a] |
DfofaS | d, c, a, b |
SR[d] ← SR[c] fad (SR[a] fod SR[b]) SV[d] ← SV[a] & SV[b] & SV[c] |
FfofaS | d, c, a, b |
SR[d] ← SR[c] faf (SR[a] fof SR[b]) SV[d] ← SV[a] & SV[b] & SV[c] |
HfofaS | d, c, a, b |
SR[d] ← SR[c] fah (SR[a] foh SR[b]) SV[d] ← SV[a] & SV[b] & SV[c] |
BfofaS | d, c, a, b |
SR[d] ← SR[c] fab (SR[a] fob SR[b]) SV[d] ← SV[a] & SV[b] & SV[c] |
P4fofaS | d, c, a, b |
SR[d] ← SR[c] fap4 (SR[a] fop4 SR[b]) SV[d] ← SV[a] & SV[b] & SV[c] |
P3fofaS | d, c, a, b |
SR[d] ← SR[c] fap3 (SR[a] fop3 SR[b]) SV[d] ← SV[a] & SV[b] & SV[c] |
Boolean Floating-Point Comparison | ||
DfcbaS | d, c, a, b |
BR[d] ← BR[c] bad (SR[a] fcd SR[b]) BV[d] ← SV[a] & SV[b] & BV[c] |
FfcbaS | d, c, a, b |
BR[d] ← BR[c] baf (SR[a] fcf SR[b]) BV[d] ← SV[a] & SV[b] & BV[c] |
HfcbaS | d, c, a, b |
BR[d] ← BR[c] bah (SR[a] fch SR[b]) BV[d] ← SV[a] & SV[b] & BV[c] |
BfcbaS | d, c, a, b |
BR[d] ← BR[c] bab (SR[a] fcb SR[b]) BV[d] ← SV[a] & SV[b] & BV[c] |
P4fcbaS | d, c, a, b |
BR[d] ← BR[c] bap4 (SR[a] fcp4 SR[b]) BV[d] ← SV[a] & SV[b] & BV[c] |
P3fcbaS | d, c, a, b |
BR[d] ← BR[c] bap3 (SR[a] fcp3 SR[b]) BV[d] ← SV[a] & SV[b] & BV[c] |
31 | 28 | 27 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op32g | i | c | i | a | d | op32 | ||||||||||||||
4 | 8 | 4 | 4 | 4 | 4 | 4 |
Index comparison immediate | ||
xcbaXI | d, c, a, imm |
BR[d] ← BR[c] ba (XR[a] xc imm12) BV[d] ← XV[a] & BV[c] |
Scalar comparison immediate | ||
icbaSI | d, c, a, imm |
BR[d] ← BR[c] ba (SR[a] ic imm12) BV[d] ← SV[a] & BV[c] |
Scalar arithmetic immediate | ||
ioiaSI | d, c, a, imm |
SR[d] ← SR[c] ia (SR[a] io imm12) SV[d] ← SV[a] & SV[c] |
lolaSI | d, c, a, imm |
SR[d] ← SR[c] la (SR[a] lo imm12) SV[d] ← SV[a] & SV[c] |
SELSI | d, c, a, imm |
SR[d] ← BR[c] ? SR[a] : imm12 SV[d] ← BV[c] & (~BR[c] | SV[a]) |
31 | 28 | 27 | 22 | 21 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op32g | op32f | m | op32c | b | a | d | op32 | |||||||||||||||
4 | 6 | 2 | 4 | 4 | 4 | 4 | 4 |
Address arithmetic: SUBXAA | ||
SUBXAA | d, a, b |
XR[d] ← 240 ∥ (AR[a]63..0 − AR[b]63..0) XV[d] ← AV[a] & AV[b] |
Index arithmetic | ||
ADDX | d, a, b |
XR[d] ← 240 ∥ (XR[a]63..0 + XR[b]63..0) XV[d] ← XV[a] & XV[b] |
SUBX | d, a, b |
XR[d] ← 240 ∥ (XR[a]63..0 − XR[b]63..0) XV[d] ← XV[a] & XV[b] |
MINUX | d, a, b |
XR[d] ← 240 ∥ minu(XR[a]63..0, XR[b]63..0) XV[d] ← XV[a] & XV[b] |
MINSX | d, a, b |
XR[d] ← 240 ∥ mins(XR[a]63..0, XR[b]63..0) XV[d] ← XV[a] & XV[b] |
MAXUX | d, a, b |
XR[d] ← 240 ∥ maxu(XR[a]63..0, XR[b]63..0) XV[d] ← XV[a] & XV[b] |
MAXSX | d, a, b |
XR[d] ← 240 ∥ maxs(XR[a]63..0, XR[b]63..0) XV[d] ← XV[a] & XV[b] |
Possible changes | ||
ADDX | d, a, b, sa |
XR[d] ← 240 ∥ (XR[a]63..0 + XR[b]63..0<<sa) XV[d] ← XV[a] & XV[b] |
SUBX | d, a, b, sa |
XR[d] ← 240 ∥ (XR[a]63..0 − XR[b]63..0<<sa) XV[d] ← XV[a] & XV[b] |
Instructions for loop iteration count prediction | ||
LOOPX | d |
trap if XV[a] & XV[b] = 0 XR[d] ← XR[a] − XR[b] XV[d] ← 1 |
Possible additions: ADDOUX, ADDOSX, ADDUSX, SUBOUX, SUBOSX, SUBUSX, MINOUSX, MAXOUSX | ||
Index logical | ||
ANDX | d, a, b |
XR[d] ← 240 ∥ (XR[a]63..0 & XR[b]63..0) XV[d] ← XV[a] & XV[b] |
ORX | d, a, b |
XR[d] ← 240 ∥ (XR[a]63..0 | XR[b]63..0) XV[d] ← XV[a] & XV[b] |
XORX | d, a, b |
XR[d] ← 240 ∥ (XR[a]63..0 ^ XR[b]63..0) XV[d] ← XV[a] & XV[b] |
SLLX | d, a, b |
XR[d] ← 240 ∥ (XR[a]63..0 <<u XR[b]5..0) XV[d] ← XV[a] & XV[b] |
SRLX | d, a, b |
XR[d] ← 240 ∥ (XR[a]63..0 >>u XR[b]5..0) XV[d] ← XV[a] & XV[b] |
SRAX | d, a, b |
XR[d] ← 240 ∥ (XR[a]63..0 >>s XR[b]5..0) XV[d] ← XV[a] & XV[b] |
Address calculation with index shift: A | ||
A | d, a, b, sa |
v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], XR[b]<<sa) = 0 AR[d] ← ea AV[d] ← 1 endif |
Non-speculative tagged word loads with indexed addressing: L{A,X,S} | ||
LA | d, a, b, sa |
v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], XR[b]<<sa) = 0 AR[d] ← sizedecode(lvload72(ea)) AV[d] ← 1 endif |
LX | d, a, b, sa |
v ← AV[a] & XV[b] if v = 0 then XR[d] ← 0 XV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], XR[b]<<sa) = 0 XR[d] ← lvload72(ea) XV[d] ← 1 endif |
LS | d, a, b, sa |
v ← AV[a] & XV[b] if v = 0 then SR[d] ← 0 SV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], XR[b]<<sa) = 0 SR[d] ← lvload72(ea) SV[d] ← 1 endif |
Non-speculative doubleword loads with indexed addressing: LAD (save/restore) and LAC (CHERI) | ||
LAD | d, a, b, sa |
v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], XR[b]<<sa) = 0 AR[d] ← lvload144(ea) AV[d] ← 1 endif |
LAC | d, a, b, sa |
v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], XR[b]<<sa) = 0 t ← lvload144(ea) trap if t71..67 ≠ 25 trap if t143..136 ≠ 252 AR[d] ← t AV[d] ← 1 endif |
Non-speculative segment relative loads with indexed addressing: RLA{64,32} | ||
RLA64 | d, a, b, sa |
v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if boundscheck(AR[a], XR[b]<<sa) = 0 t ← lvload64(ea) AR[d] ← segrelative(AR[a], t) AV[d] ← 1 endif |
RLA32 | d, a, b, sa |
v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if boundscheck(AR[a], XR[b]<<sa) = 0 t ← lvload32(ea) AR[d] ← segrelative(AR[a], 032 ∥ t) AV[d] ← 1 endif |
Non-speculative sub-word loads with indexed addressing: L{A,X,S}{8,16,32,64}{U,S} | ||
LX64 | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload64(AR[a] +p XR[b]<<sa) : 0 XR[d] ← 240 ∥ t XV[d] ← v |
LS64 | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload64(AR[a] +p XR[b]<<sa) : 0 SR[d] ← 240 ∥ t SV[d] ← v |
LX32U | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload32(AR[a] +p XR[b]<<sa) : 0 XR[d] ← 240 ∥ 032 ∥ t XV[d] ← v |
LS32U | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload32(AR[a] +p XR[b]<<sa) : 0 SR[d] ← 240 ∥ 032 ∥ t SV[d] ← v |
LX32S | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload32(AR[a] +p XR[b]<<sa) : 0 XR[d] ← 240 ∥ t3132 ∥ t XV[d] ← v |
LS32S | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload32(AR[a] +p XR[b]<<sa) : 0 SR[d] ← 240 ∥ t3132 ∥ t SV[d] ← v |
LX16U | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload16(AR[a] +p XR[b]<<sa) : 0 XR[d] ← 240 ∥ 048 ∥ t XV[d] ← v |
LS16U | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload16(AR[a] +p XR[b]<<sa) : 0 SR[d] ← 240 ∥ 048 ∥ t SV[d] ← v |
LX16S | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload16(AR[a] +p XR[b]<<sa) : 0 XR[d] ← 240 ∥ t1548 ∥ t XV[d] ← v |
LS16S | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload16(AR[a] +p XR[b]<<sa) : 0 SR[d] ← 240 ∥ t1548 ∥ t SV[d] ← v |
LX8U | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]) = 0 t ← v ? lvload8(AR[a] +p XR[b]) : 0 XR[d] ← 240 ∥ 056 ∥ t XV[d] ← v |
LS8U | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]) = 0 t ← v ? lvload8(AR[a] +p XR[b]) : 0 SR[d] ← 240 ∥ 056 ∥ t SV[d] ← v |
LX8S | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]) = 0 t ← lvload8(AR[a] +p XR[b]) : 0 XR[d] ← 240 ∥ t756 ∥ t XV[d] ← v |
LS8S | d, a, b, sa |
v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]) = 0 t ← lvload8(AR[a] +p XR[b]) SR[d] ← 240 ∥ t756 ∥ t SV[d] ← v |
Load vector mask instructions with indexed addressing | ||
LM | d, a, b, sa |
v ← AV[a] & XV[b] trap if v = 0 ea ← AR[a] +p XR[b]<<sa trap if boundscheck(AR[a], XR[b]<<sa) = 0 VM[d] ← lvload128(ea) |
Speculative tagged word loads with indexed addressing: SL{A,X,S} | ||
SLA | d, a, b, sa |
v ← AV[a] & XV[b] & boundscheck(AR[a], XR[b]<<sa) if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if ea2..0 ≠ 03 AR[d] ← sizedecode(lvload72(ea)) AV[d] ← 1 endif |
SLX | d, a, b, sa |
v ← AV[a] & XV[b] & boundscheck(AR[a], XR[b]<<sa) if v = 0 then XR[d] ← 0 XV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if ea2..0 ≠ 03 XR[d] ← lvload72(ea) XV[d] ← 1 endif |
SLS | d, a, b, sa |
v ← AV[a] & XV[b] & boundscheck(AR[a], XR[b]<<sa) if v = 0 then SR[d] ← 0 SV[d] ← 0 else ea ← AR[a] +p XR[b]<<sa trap if ea2..0 ≠ 03 SR[d] ← lvload72(ea) SV[d] ← 1 endif |
Speculative sub-word loads with indexed addressing: SL{X,S}{8,16,32,64}{U,S} (TBD) |
31 | 28 | 27 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op32g | i | op32c | i | a | d | op32 | ||||||||||||||
4 | 8 | 4 | 4 | 4 | 4 | 4 |
Index arithmetic immediate | ||
ADDXI | d, a, imm |
XR[d] ← 240 ∥ (XR[a]63..0 + imm121152∥imm12) XV[d] ← XV[a] |
ANDXI | d, a, imm |
XR[d] ← 240 ∥ (XR[a]63..0 & imm121152∥imm12) XV[d] ← XV[a] |
MINUXI | d, a, imm |
XR[d] ← 240 ∥ minu(XR[a]63..0, imm121152∥imm12) XV[d] ← XV[a] |
MINSXI | d, a, imm |
XR[d] ← 240 ∥ mins(XR[a]63..0, imm121152∥imm12) XV[d] ← XV[a] |
MAXUXI | d, a, imm |
XR[d] ← 240 ∥ maxu(XR[a]63..0, imm121152∥imm12) XV[d] ← XV[a] |
MAXSXI | d, a, imm |
XR[d] ← 240 ∥ maxs(XR[a]63..0, imm121152∥imm12) XV[d] ← XV[a] |
RSUBXI | d, imm, a |
XR[d] ← 240 ∥ ((imm121152∥imm12) − XR[a]63..0) XV[d] ← XV[a] |
RSUBI | d, imm, a |
SR[d] ← 240 ∥ ((imm121152∥imm12) − SR[a]63..0) SV[d] ← SV[a] |
Index logical immediate | ||
ORXI | d, a, imm |
XR[d] ← 240 ∥ (XR[a]63..0 | imm121152∥imm12) XV[d] ← XV[a] |
XORXI | d, a, imm |
XR[d] ← 240 ∥ (XR[a]63..0 ^ imm121152∥imm12) XV[d] ← XV[a] |
Scalar integer logical immediate | ||
loSI | d, b, imm |
SR[d] ← 240 ∥ (SR[a]63..0 lo imm121152∥imm12) SV[d] ← SV[a] |
Scalar integer arithmetic immediate | ||
ioSI | d, b, imm |
SR[d] ← 240 ∥ (SR[a]63..0 io imm121152∥imm12) SV[d] ← SV[a] |
ADDSI | d, b, imm |
SR[d] ← 240 ∥ (SR[a]63..0 + imm121152∥imm12) SV[d] ← SV[a] |
SUBSI | d, b, imm |
SR[d] ← 240 ∥ (SR[a]63..0 − imm121152∥imm12) SV[d] ← SV[a] |
MINUSI | d, b, imm |
SR[d] ← 240 ∥ minu(SR[a]63..0, imm121152∥imm12) SV[d] ← SV[a] |
MINSSI | d, b, imm |
SR[d] ← 240 ∥ mins(SR[a]63..0, imm121152∥imm12) SV[d] ← SV[a] |
MINUSSI | d, b, imm |
SR[d] ← 240 ∥ minus(SR[a]63..0, imm121152∥imm12) SV[d] ← SV[a] |
MAXUSI | d, b, imm |
SR[d] ← 240 ∥ maxu(SR[a]63..0, imm121152∥imm12) SV[d] ← SV[a] |
MAXSSI | d, b, imm |
SR[d] ← 240 ∥ maxs(SR[a]63..0, imm121152∥imm12) SV[d] ← SV[a] |
MAXUSSI | d, b, imm |
SR[d] ← 240 ∥ maxus(SR[a]63..0, imm121152∥imm12) SV[d] ← SV[a] |
Non-speculative tagged word load/store with immediate addressing: L{A,X,S}I | ||
AI | d, a, imm |
if AV[a] = 0 then AR[d] ← 0 AV[d] ← 0 else trap if boundscheck(AR[a], imm12) = 0 AR[d] ← AR[a] +p imm12 AV[d] ← 1 endif |
LAI | d, a, imm |
if AV[a] = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p imm12 trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], imm12) = 0 AR[d] ← sizedecode(lvload72(ea)) AV[d] ← 1 endif |
LXI | d, a, imm |
if AV[a] = 0 then XR[d] ← 0 XV[d] ← 0 else ea ← AR[a] +p imm12 trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], imm12) = 0 XR[d] ← lvload72(ea) XV[d] ← 1 endif |
LSI | d, a, imm |
if AV[a] = 0 then SR[d] ← 0 SV[d] ← 0 else ea ← AR[a] +p imm12 trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], imm12) = 0 SR[d] ← lvload72(ea) SV[d] ← 1 endif |
Non-speculative doubleword loads with indexed addressing: LADI (save/restore) and LACI (CHERI) | ||
LADI | d, a, imm |
if AV[a] = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p imm12 trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], imm12) = 0 AR[d] ← lvload144(ea) AV[d] ← 1 endif |
LACI | d, a, imm |
if AV[a] = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +p imm12 trap if ea2..0 ≠ 03 trap if boundscheck(AR[a], imm12) = 0 t ← lvload144(ea) trap if t71..67 ≠ 25 trap if t143..136 ≠ 252 AR[d] ← t AV[d] ← 1 endif |
Non-speculative sub-word load/store with immediate addressing: L{X,S}{8,16,32,64}{U,S}I | ||
LX64I | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload64(AR[a] +p imm12) : 0 XR[d] ← 240 ∥ t XR[d] ← v |
LX32UI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload32(AR[a] +p imm12) : 0 XR[d] ← 240 ∥ 032 ∥ t XR[d] ← v |
LS32UI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload32(AR[a] +p imm12) : 0 SR[d] ← 240 ∥ 032 ∥ t SV[d] ← v |
LX32SI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload32(AR[a] +p imm12) : 0 XR[d] ← 240 ∥ t3132 ∥ t XV[d] ← v |
LS32SI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload32(AR[a] +p imm12) : 0 SR[d] ← 240 ∥ t3132 ∥ t SV[d] ← v |
LX16UI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload16(AR[a] +p imm12) : 0 XR[d] ← 240 ∥ 048 ∥ t XV[d] ← v |
LS16UI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload16(AR[a] +p imm12) : 0 SR[d] ← 240 ∥ 048 ∥ t SV[d] ← v |
LX16SI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload16(AR[a] +p imm12) : 0 XR[d] ← 240 ∥ t1548 ∥ t XV[d] ← v |
LS16SI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload16(AR[a] +p imm12) : 0 SR[d] ← 240 ∥ t1548 ∥ t SV[d] ← v |
LX8UI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload8(AR[a] +p imm12) : 0 XR[d] ← 240 ∥ 056 ∥ t XV[d] ← v |
LS8UI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload8(AR[a] +p imm12) : 0 SR[d] ← 240 ∥ 056 ∥ t SV[d] ← v |
LX8SI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload8(AR[a] +p imm12) : 0 XR[d] ← 240 ∥ t756 ∥ t XV[d] ← v |
LS8SI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload8(AR[a] +p imm12) : 0 SR[d] ← 240 ∥ t756 ∥ t SV[d] ← v |
Speculative word load/store with immediate addressing: L{A,X,S}I | ||
SLAI | d, a, imm |
v ← AV[a] & boundscheck(AR[a], imm12) AR[d] ← v ? lvload72(AR[a] +p imm12) : 0 AV[d] ← v |
SLXI | d, a, imm |
v ← AV[a] & boundscheck(AR[a], imm12) XR[d] ← v ? lvload72(AR[a] +p imm12) : 0 XV[d] ← v |
SLSI | d, a, imm |
v ← AV[a] & boundscheck(AR[a], imm12) SR[d] ← v ? lvload72(AR[a] +p imm12) : 0 SV[d] ← v |
Instructions for loop iteration count prediction | ||
LOOPXI | d |
trap if XV[a] = 0 XR[d] ← XR[a] + imm121152∥imm12 XV[d] ← 1 |
31 | 28 | 27 | 22 | 21 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op32g | op32f | op32c | op32b | a | d | op32 | ||||||||||||||
4 | 6 | 6 | 4 | 4 | 4 | 4 |
Instructions for save/restore | ||
MOVSB | d, a |
SR[d] ← 240 ∥ 063 ∥ BR[a] SV[d] ← BV[a] |
MOVBS | d, a, imm6 |
BR[d] ← SR[a]imm6 BV[d] ← SV[a] |
MOVSBALL | d |
SR[d] ← 240 ∥ 032 ∥ BV[15]∥BV[14]∥…∥BV[1]∥1 ∥ BR[15]∥BR[14]∥…∥BR[1]∥0 SV[d] ← 1 |
MOVBALLS | d |
BR[1] ← SR[a]1 BR[2] ← SR[a]2 ︙ BR[15] ← SR[a]15 BV[1] ← SR[a]17 BV[2] ← SR[a]18 ︙ BV[15] ← SR[a]31 |
MOVXAVALL | d |
XR[d] ← 240 ∥ 048 ∥ AV[15]∥AV[14]∥…∥AV[1]∥AV[0] XV[d] ← 1 |
MOVAVALLX | d |
AV[1] ← XR[a]1 AV[2] ← XR[a]2 ︙ AV[15] ← XR[a]15 |
MOVXXVALL | d |
XR[d] ← 240 ∥ 048 ∥ XV[15]∥XV[14]∥…∥XV[1]∥XV[0] XV[d] ← 1 |
MOVXVALLX | d |
XV[1] ← XR[a]1 XV[2] ← XR[a]2 ︙ XV[15] ← XR[a]15 |
MOVSM | d, m, w |
SR[d] ← 240 ∥ VM[a]w×64+63..w×64 SV[d] ← 1 |
MOVMS | d, a, w |
trap if SV[a] = 0 VM[d]w×64+63..w×64 ← SR[a] |
31 | 28 | 27 | 22 | 21 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op32g | op32f | imm6 | b | a | d | op32 | ||||||||||||||
4 | 6 | 6 | 4 | 4 | 4 | 4 |
FUNSI | d, a, b, i |
t ← (SR[b]63..0∥SR[a]63..0) >> imm6 SR[d] ← 240 ∥ t63..0 SV[d] ← SV[a] & SV[b] |
ROTRSI | d, a, i | assembler expands to FUNSI d, a, a, i |
ROTLSI | d, a, i | assembler expands to FUNSI d, a, a, (−i)5..0 |
SLLXI | d, a, imm |
XR[d] ← 240 ∥ (XR[a]63..0 <<u imm6) XV[d] ← XV[a] |
SRLXI | d, a, imm |
XR[d] ← 240 ∥ (XR[a]63..0 >>u imm6) XV[d] ← XV[a] |
SRAXI | d, a, imm |
XR[d] ← 240 ∥ (XR[a]63..0 >>s imm6) XV[d] ← XV[a] |
31 | 28 | 27 | 22 | 21 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op32g | op32f | m | c | b | a | op32d | op32 | |||||||||||||||
4 | 6 | 2 | 4 | 4 | 4 | 4 | 4 |
Store address instructions with indexed addressing | ||
SA | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 trap if (AR[a]2..0 + XR[b]2..0) ≠ 03 lvstore72(AR[a] +p XR[b]<<sa) ← AR2mem72(AR[c]) |
SAD | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 trap if (AR[a]3..0 + XR[b]3..0) ≠ 04 lvstore144(AR[a] +p XR[b]<<sa) ← AR2mem144(AR[c]) |
SAC | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 trap if (AR[a]3..0 + XR[b]3..0) ≠ 04 lvstore144(AR[a] +p XR[b]<<sa) ← AR2CHERImem144(AR[c]) |
Store index instructions with indexed addressing | ||
SX | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore72(AR[a] +p XR[b]<<sa) ← XR[c] |
SX64 | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore64(AR[a] +p XR[b]<<sa) ← XR[c]63..0 |
SX32 | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore32(AR[a] +p XR[b]<<sa) ← XR[c]31..0 |
SX16 | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore16(AR[a] +p XR[b]<<sa) ← XR[c]15..0 |
SX8 | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore8(AR[a] +p XR[b]<<sa) ← XR[c]7..0 |
Store scalar instructions with indexed addressing | ||
SS | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore72(AR[a] +p XR[b]<<sa) ← SR[c] |
SS64 | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore64(AR[a] +p XR[b]<<sa) ← SR[c]63..0 |
SS32 | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore32(AR[a] +p XR[b]<<sa) ← SR[c]31..0 |
SS16 | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore16(AR[a] +p XR[b]<<sa) ← SR[c]15..0 |
SS8 | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore8(AR[a] +p XR[b]<<sa) ← SR[c]7..0 |
Store vector mask instructions with indexed addressing | ||
SM | c, a, b, sa |
trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore128(AR[a] +p XR[b]<<sa) ← VM[c] |
Branch instructions | ||
Bboba | c, a, b | branch if BR[c] ba (BR[a] bo BR[b]) |
BbaEQA | c, a, b | branch if BR[c] ba (AR[a] = AR[b]) |
BbaEQX | c, a, b | branch if BR[c] ba (XR[a] = XR[b]) |
BbaNEA | c, a, b | branch if BR[c] ba (AR[a] ≠ AR[b]) |
BbaNEX | c, a, b | branch if BR[c] ba (XR[a] ≠ XR[b]) |
BbaLTAU | c, a, b | branch if BR[c] ba (AR[a] <u AR[b]) |
BbaLTXU | c, a, b | branch if BR[c] ba (XR[a] <u XR[b]) |
BbaGEAU | c, a, b | branch if BR[c] ba (AR[a] ≥u AR[b]) |
BbaGEXU | c, a, b | branch if BR[c] ba (XR[a] ≥u XR[b]) |
BbaLTX | c, a, b | branch if BR[c] ba (XR[a] <s XR[b]) |
BbaGEX | c, a, b | branch if BR[c] ba (XR[a] ≥s XR[b]) |
BbaNONEX | c, a, b | branch if BR[c] ba ((XR[a] & XR[b]) = 0) |
BbaANYX | c, a, b | branch if BR[c] ba ((XR[a] & XR[b]) ≠ 0) |
assembler simplified versions of the above | ||
Bbo | a, b | equivalent to BORbo b0, a, b |
BEQA | a, b | equivalent to BOREQA b0, a, b |
BEQX | a, b | branch if XR[a] = XR[b] |
BNEA | a, b | branch if AR[a] ≠ AR[b] |
BNEX | a, b | branch if XR[a] ≠ XR[b] |
BLTAU | a, b | branch if AR[a] <u AR[b] |
BLTXU | a, b | branch if XR[a] <u XR[b] |
BGEAU | a, b | branch if AR[a] ≥u AR[b] |
BGEXU | a, b | branch if XR[a] ≥u XR[b] |
BLTX | a, b | branch if XR[a] <s XR[b] |
BGEX | a, b | branch if XR[a] ≥s XR[b] |
BNONEX | a, b | branch if (XR[a] & XR[b]) = 0 |
BANYX | a, b | branch if (XR[a] & XR[b]) ≠ 0 |
31 | 28 | 27 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op32g | i | c | i | a | op32d | op32 | ||||||||||||||
4 | 8 | 4 | 4 | 4 | 4 | 4 |
Store address instructions with immediate addressing | ||
SAI | c, a, imm | lvstore72(AR[a] +p imm12) ← AR[c] |
SADI | c, a, imm | lvstore144(AR[a] +p imm12) ← AR[c] |
Store index instructions with immediate addressing | ||
SXI | c, a, imm | lvstore72(AR[a] +p imm12) ← XR[c] |
SX64I | c, a, imm | lvstore64(AR[a] +p imm12) ← AR[c]63..0 |
SX32I | c, a, imm | lvstore32(AR[a] +p imm12) ← AR[c]31..0 |
SX16I | c, a, imm | lvstore16(AR[a] +p imm12) ← AR[c]15..0 |
SX8I | c, a, imm | lvstore8(AR[a] +p imm12) ← AR[c]7..0 |
Store scalar instructions with immediate addressing | ||
SSI | c, a, imm | lvstore72(AR[a] +p imm12) ← SR[c] |
SS64I | c, a, imm | lvstore64(AR[a] +p imm12) ← SR[c]63..0 |
SS32I | c, a, imm | lvstore32(AR[a] +p imm12) ← SR[c]31..0 |
SS16I | c, a, imm | lvstore16(AR[a] +p imm12) ← SR[c]15..0 |
SS8I | c, a, imm | lvstore8(AR[a] +p imm12) ← SR[c]7..0 |
Branch instructions with immediate comparison | ||
BbaEQXI | c, b, imm12 | branch if BR[c] ba (XR[a] = imm12) |
BbaNEXI | c, a, imm12 | branch if BR[c] ba (XR[a] ≠ imm12) |
BbaLTUXI | c, a, imm12 | branch if BR[c] ba (XR[a] <u imm12) |
BbaGEUXI | c, a, imm12 | branch if BR[c] ba (XR[a] ≥u imm12) |
BbaLTXI | c, a, imm12 | branch if BR[c] ba (XR[a] <s imm12) |
BbaGEXI | c, a, imm12 | branch if BR[c] ba (XR[a] ≥s imm12) |
BbaNONEXI | c, a, imm12 | branch if BR[c] ba ((XR[a] & imm12) = 0) |
BbaANYXI | c, a, imm12 | branch if BR[c] ba ((XR[a] & imm12) ≠ 0) |
assembler simplified versions of the above | ||
BEQXI | a, imm | equivalent to BOREQXI b0, a, imm |
BNEXI | a, imm | equivalent to BORNEXI b0, a, imm |
BLTUXI | a, imm | equivalent to BORLTUXI b0, a, imm |
BGEUXI | a, imm | equivalent to BORGEUXI b0, a, imm |
BLTXI | a, imm | equivalent to BORLTXI b0, a, imm |
BGEXI | a, imm | equivalent to BORGEXI b0, a, imm |
BNONEXI | a, imm | equivalent to BORNONEXI b0, a, imm |
BANYXI | a, imm | equivalent to BORANYXI b0, a, imm |
Instructions yet to be grouped | ||
SWITCHI | a, imm | PC ← AR[a] +p imm12 |
LJMPI | a, imm | PC ← lvload72(AR[a] +p imm12) |
LJMP | a, b | PC ← lvload72(AR[a] +p XR[b]<<sa) |
FENCE | This is a placeholder for various FENCE instructions that need to be defined. | |
WFI | Wait For Interrupt for the current ring. May be intercepted by more privileged rings. | |
HALT | The processor finishes all outstanding operands and halts. It will only be woken by Soft Reset. Ring 7 only. | |
BREAK | This is a placeholder for later definition. | |
ILL | This is a placeholder for later definition. | |
CSR* | This is a placeholder for later definition. | |
fmtCLASSS | This is a placeholder for later definition. |
31 | 28 | 27 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||
op32g | imm16 | a | d | op32 | ||||||||||
4 | 16 | 4 | 4 | 4 |
AI | d, a, imm |
v ← AV[a] trap if v & boundscheck(AR[a], imm16) = 0 AR[d] ← AR[a] +p imm16 AV[d] ← v |
Stack frame allocation for upward and downward stacks | ||
ENTRY | d, a, imm8 |
trap if imm8 ≥ 192 osp ← AR[a] oring ← osp135..133 osize ← 03 ∥ osp132..72 oaddr ← osp63..0 naddr ← oaddr + osize e ← imm87..4 nsize ← e = 0 ? 054∥imm83..0∥03 : 053−s∥1∥imm83..0∥0e∥02 nring ← min(PC.ring, oring) ssize ← segsize(oaddr) trap if naddr63..ssize ≠ oaddr63..ssize nsp ← 251 ∥ nring ∥ nsize ∥ imm8 ∥ naddr lvstore72(nsp) ← osp71..0 AR[d] ← nsp |
ENTRYD | d, a, imm8 |
trap if imm8 ≥ 192 osp ← AR[a] oring ← osp135..133 oaddr ← osp63..0 e ← imm87..4 nsize ← e = 0 ? 054∥imm83..0∥03 : 053−s∥1∥imm83..0∥0e∥02 naddr ← oaddr − nsize nring ← min(PC.ring, oring) ssize ← segsize(oaddr) trap if naddr63..ssize ≠ oaddr63..ssize nsp ← 251 ∥ nring ∥ nsize ∥ imm8 ∥ naddr lvstore72(nsp) ← osp71..0 AR[d] ← nsp |
31 | 8 | 7 | 4 | 3 | 0 | |||
imm24 | d | op32 | ||||||
24 | 4 | 4 |
XI | d, imm |
XR[d] ← 240 ∥ imm242340∥imm24 XV[d] ← 1 |
XUI | d, imm |
XR[d] ← 240 ∥ imm24∥040 XV[d] ← 1 |
SI | d, imm |
SR[d] ← 240 ∥ imm242340∥imm24 SV[d] ← 1 |
SUI | d, imm |
SR[d] ← 240 ∥ imm24∥040 SV[d] ← 1 |
DUI | d, imm |
SR[d] ← 244 ∥ imm24∥040 SV[d] ← 1 |
FI | d, imm |
SR[d] ← 245 ∥ 032∥imm24∥08 SV[d] ← 1 |
1:0 3:2 |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 3XI | 3XUI | 3SI | 3SUI |
1 | 3ADDXI | 3ADDXUI | 3ADDSI | 3ADDSUI |
2 | 3ANDXI | 3ANDUXI | i48v | |
3 | 3ORXI | 3ORXUI | 3FI | 3DUI |
47 | 24 | 23 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op48dabcm | e | c | b | a | d | op48 | ||||||||||||||
24 | 4 | 4 | 4 | 4 | 4 | 4 |
Vector-Vector Integer | ||
3ioiaVV | d, c, a, b |
VR[d] ← VR[c] ia (VR[a] io VR[b]) masked by VM[e] |
3lolaVV | d, c, a, b |
VR[d] ← VR[c] la (VR[a] lo VR[b]) masked by VM[e] |
3SELVV | d, c, a, b |
VR[d] ← select(VM[c], VR[a], VR[b]) masked by VM[e] |
3i1S | d, a |
VR[d] ← i1(VR[a]) masked by VM[e] |
Vector-Scalar Integer | ||
3ioiaVS | d, c, a, b |
VR[d] ← VR[c] ia (VR[a] io SR[b]) masked by VM[e] |
3lolaVS | d, c, a, b |
VR[d] ← VR[c] la (VR[a] lo SR[b]) masked by VM[e] |
3SELVS | d, c, a, b |
VR[d] ← select(VM[c], VR[a], SR[b]) masked by VM[e] |
Vector-Immediate Integer | ||
3ioiaVI | d, c, a, imm |
VR[d] ← VR[c] ia (VR[a] io imm) masked by VM[e] |
3lolaVI | d, c, a, imm |
VR[d] ← VR[c] la (VR[a] lo imm) masked by VM[e] |
3SELVI | d, c, a, imm |
VR[d] ← select(VM[c], VR[a], imm) masked by VM[e] |
Vector-Vector integer comparison | ||
3icbaVV | d, c, a, b | VM[d] ← VM[c] ba (VR[a] ic VR[b]) |
Vector-Scalar integer comparison | ||
3icbaVS | d, c, a, b | VM[d] ← VM[c] ba (VR[a] ic SR[b]) |
Vector-Immediate integer comparison | ||
3icbaVI | d, c, a, imm | VM[d] ← VM[c] ba (VR[a] ic imm12) |
Vector-Vector Floating-Point | ||
3DfofaVV | d, c, a, b, e |
VR[d] ← VR[c] fad (VR[a] fod VR[b]) masked by VM[e] |
3FfofaVV | d, c, a, b, e |
VR[d] ← VR[c] faf (VR[a] fof VR[b]) masked by VM[e] |
3HfofaVV | d, c, a, b, e |
VR[d] ← VR[c] fas (VR[a] foh VR[b]) masked by VM[e] |
3BfofaVV | d, c, a, b, e |
VR[d] ← VR[c] fas (VR[a] fob VR[b]) masked by VM[e] |
3P4fofaVV | d, c, a, b, e |
VR[d] ← VR[c] fab (VR[a] fop4 VR[b]) masked by VM[e] |
3P3fofaVV | d, c, a, b, e |
VR[d] ← VR[c] fab (VR[a] fop3 VR[b]) masked by VM[e] |
Vector-Scalar Floating-Point | ||
3DfofaVS | d, c, a, b, e |
VR[d] ← VR[c] fad (VR[a] fod SR[b]) masked by VM[e] |
3FfofaVS | d, c, a, b, e |
VR[d] ← VR[c] faf (VR[a] fof SR[b]) masked by VM[e] |
3HfofaVS | d, c, a, b, e |
VR[d] ← VR[c] fah (VR[a] foh SR[b]) masked by VM[e] |
3BfofaVS | d, c, a, b, e |
VR[d] ← VR[c] fas (VR[a] fob SR[b]) masked by VM[e] |
3P4fofaVS | d, c, a, b, e |
VR[d] ← VR[c] fab (VR[a] fop4 SR[b]) masked by VM[e] |
3P3fofaVS | d, c, a, b, e |
VR[d] ← VR[c] fab (VR[a] fop3 SR[b]) masked by VM[e] |
Vector-Vector floating comparison | ||
3DfcbaVV | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcd VR[b]) |
3FfcbaVV | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcf VR[b]) |
3HfcbaVV | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fch VR[b]) |
3BfcbaVV | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcb VR[b]) |
3P4fcbaVV | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcp4 VR[b]) |
3P3fcbaVV | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcp3 VR[b]) |
Vector-Scalar floating comparison | ||
3DfcbaVS | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcd SR[b]) |
3SfcbaVS | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcs SR[b]) |
3HfcbaVS | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fch SR[b]) |
3BfcbaVS | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcb SR[b]) |
3P4fcbaVS | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcp4 SR[b]) |
3P3fcbaVS | d, c, a, b | VM[d] ← VM[c] ba (VR[a] fcp3 SR[b]) |
47 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
imm36 | a | d | op48 | ||||||||
36 | 4 | 4 | 4 |
3ADDXI | d, a, imm |
XR[d] ← XR[a] + imm363528∥imm36 XV[d] ← XV[a] |
3ADDXUI | d, a, imm |
XR[d] ← XR[a] + imm36∥028 XV[d] ← XV[a] |
3ANDSI | d, a, imm |
SR[d] ← SR[a] & imm363528∥imm36 SV[d] ← SV[a] |
3ANDSUI | d, a, imm |
SR[d] ← SR[a] & imm36∥028 SV[d] ← SV[a] |
3ORSI | d, a, imm |
SR[d] ← SR[a] | imm363528∥imm36 SV[d] ← SV[a] |
3ORSUI | d, a, imm |
SR[d] ← SR[a] | imm36∥028 SV[d] ← SV[a] |
3ADDDUI | d, a, imm |
SR[d] ← SR[a] +d imm36∥028 SV[d] ← SV[a] |
47 | 8 | 7 | 4 | 3 | 0 | |||
imm40 | d | op48 | ||||||
40 | 4 | 4 |
3XI | d, imm |
XR[d] ← 240 ∥ imm403924∥imm40 XV[d] ← 1 |
3XUI | d, imm |
XR[d] ← 240 ∥ imm40∥024 XV[d] ← 1 |
3SI | d, imm |
SR[d] ← 240 ∥ imm403924∥imm40 SV[d] ← 1 |
3SUI | d, imm |
SR[d] ← 240 ∥ imm40∥024 SV[d] ← 1 |
3DUI | d, imm |
SR[d] ← 244 ∥ imm40∥024 SV[d] ← 1 |
3FI | d, imm |
SR[d] ← 245 ∥ 024∥imm40 SV[d] ← 1 |
1:0 3:2 |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 4I | 4UI | 4SI | 4SUI |
1 | ||||
2 | ||||
3 | 4FI | 4DUI |
63 | 8 | 7 | 4 | 3 | 0 | |||
imm56 | d | op64 | ||||||
56 | 4 | 4 |
4XI | d, imm |
XR[d] ← 240 ∥ imm56558∥imm56 XV[d] ← 1 |
4XUI | d, imm |
XR[d] ← 240 ∥ imm56∥08 XV[d] ← 1 |
4SI | d, imm |
SR[d] ← 240 ∥ imm56558∥imm56 SV[d] ← 1 |
4SUI | d, imm |
SR[d] ← 240 ∥ imm56∥08 SV[d] ← 1 |
4DUI | d, imm |
SR[d] ← 244 ∥ imm56∥08 SV[d] ← 1 |
4FI | d, imm |
SR[d] ← 245 ∥ 08∥imm56 SV[d] ← 1 |
63 | 24 | 23 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op64dabcm | e | c | b | a | d | op64 | ||||||||||||||
40 | 4 | 4 | 4 | 4 | 4 | 4 |
I expect SecureRISC software to use the ILP64 model, where integers and pointers are both 64 bits. Even in the 1980s when MIPS was defining its 64‑bit ISA, I argued that integers should be 64 bits, but keeping integers 32 bits for C was considered sacred by others. The result is that an integer cannot index a large array, which is terrible. With ILP64, I don’t expect SecureRISC to need special 32‑bit add instructions (that sign-extend from bit 31 to bits 63..32).
Translation is a two-stage process, where in the first stage a Local Virtual Address (lvaddr) is translated to a System Virtual Address (svaddr), and then in the second stage that address is then translated to a System Interconnect Address (siaddr). The lvaddr→svaddr translation may involve multiple svaddr reads, each of which has to also be translated to a siaddr during the process. A full translation is therefore very costly and is typically cached as direct lvaddr→siaddr to make the process much faster after the first time. The following sections first describe the lvaddr→svaddr process, and then subsequent sections describe the svaddr→siaddr process. These translations are somewhat similar, with minor differences. Once the first stage lvaddr→svaddr process is understood, the second stage svaddr→siaddr process will be straight-forward. Some systems may set up a minimal second stage translation process, but the process is still important for determining the cache and memory attributes and permissions, as stored in Region Descriptor Table (RDT).
Translation of local virtual addresses (lvaddrs) to system interconnect addresses (siaddrs) is typically performed in a single processor cycle in one of several L1 translation caches (often called TLBs), which may be supplemented with one or more L2 TLBs. If the TLBs fail to provide translate the address, then the processor performs a lengthier procedure, and if that succeeds, then the result is written into the TLBs to speed later translations. This TLB miss procedure determines the memory architecture. As described earlier, SecureRISC uses both segmentation and paging in its memory architecture. The first step of a TLB miss is therefore to determine a segment descriptor and then proceed as that directs. One way of thinking about SecureRISC segmentation is that is a specialized first-level page table that controls the subsequent levels, including giving the page size and table depth (derived from the segment size). After the segment descriptor, 0 to 4 levels of page table walk are used to complete the translation, as depending on the table values set by the supervisor. While 4‑level page tables are supported, SecureRISC is designed to avoid this if the operating system can use its features, as multiple-level page tables needlessly increase the TLB miss penalty.
SecureRISC segments may be directly mapped to an aligned system virtual address range equal to the segment size, or they may be paged. Direct mapping may be appropriate to I/O regions, for example. It consists of simply changing the high bits (above the segment size) of the local virtual address to the appropriate system virtual address bits and leaving the low bits (below the segment size) unchanged.
Processors today all implement some form of paging in their virtual address translation. Paging exists for several reasons. The most critical today is to simplify memory allocation in the operating system, as without paging, it would be necessary to find contiguous regions of memory to assign to address spaces. A secondary purpose is to allow a larger virtual address space than physical memory, which performs reasonably if the working set of the process fits in the physical memory (i.e. it does not use all of its virtual memory all the time).
A critical processor design decision is the choice of a page size or page sizes. If minimizing memory overhead is the criteria, it is well known that the optimal page size for an area of virtual memory is proportional to the square root of that memory size. Back in the 1960s, 1024 words (which became 4 KiB with byte addressing) was frequently chosen as the page size back to minimize the memory wasted by allocating in page units and the size of the page table. This size has been carried forward with some variation for decades. The trade-offs are different in 2020s from the 1960s, so it deserves another look. Even the old 1024 words would suggest a page size of 8 KiB today. Today, with much larger address spaces, multi-level page tables are typically used, often with the same page size at each level. The number of levels, and therefore the TLB miss penalty is then a factor in the page size consideration that did not exist in the 1960s.
In addition, today regions of memory vary wildly in size in computer systems, with many processes having small code regions, a small stack region, and a heap that may be small, large, or huge, and sometimes the size is dependent upon input parameters. Even in processors that support multiple page sizes, size is often set for the entire system. When page size is variable at runtime, there may be only one value for the entire process virtual address space, which makes the value for setting be sub-optimal for code, stack, or heap, depending on which is chosen for optimization. Further, memory overhead is not the only criteria of importance. Larger page sizes minimize translation cache misses and therefore improve performance at the cost of memory wastage. Larger page sizes may also reduce the translation cache miss penalty when multi-level page tables are used (as is common today), by potentially reducing the number of levels to be read on a miss.
A major advantage of segmentation is that it becomes possible to choose different page sizes on a per segment basis. Each shared library and the main program are individual segments containing code, and each could have a page size appropriate to its size. The stack and heap segments can likewise have different page sizes from the code segments and each other. Choosing a page size based on the square root of the segment size not only minimizes memory wastage, but it can also keep the page table a single level, which minimizes the translation cache miss penalty.
There is a cost to implementing multiple page sizes in the operating system. Typically, free lists are maintained for each page size, and when a smaller page free list is empty, a large page is split up. The reverse process, of coalescing pages, is more involved, as it may be necessary to migrate one or more small pages to put back together what was split apart. This however has been implemented in operating systems and made to work well.
There is also a cost to implementing multiple page sizes in translation caches (typically called TLBs though that is a terrible name). The most efficient hardware for translation caches would prefer a single page size, or failing that, a fairly small number of page sizes. Page size flexibility can affect critical processor timing paths. Despite this, the trend has been toward supporting a small number of page sizes. The inclusion of a vector architecture helps to address this issue, as vector loads and stores are not as latency sensitive as scalar loads and stores, and therefore can go directly to a L2 translation cache, which is both larger, and as a result of being larger slower, and therefore better able to absorb the cost of multiple page size matching. Much of the need for larger sizes occurs in applications with huge memory needs, and these applications are often able to exploit the vector architecture.
It may help to consider what historical architectures have for page size options. According to Wikipedia other 64‑bit architectures have supported the following page sizes:
Architecture | 4 KiB | 8 KiB | 16 KiB | 64 KiB | 2 MiB | 1 GiB | Other |
---|---|---|---|---|---|---|---|
MIPS | ✔ | ✔ | ✔ | 256 KiB, 1 MiB, 4 MiB, 16 MiB | |||
x86-64 | ✔ | ✔ | ✔ | ||||
ARM | ✔ | ✔ | ✔ | ✔ | ✔ | 32 MiB, 512 MiB | |
RISC‑V | ✔ | ✔ | ✔ | 512 GiB, 256 TiB | |||
Power | ✔ | ✔ | 16 MiB, 16 GiB | ||||
UltraSPARC | ✔ | ✔ | 512 KiB, 4 MiB, 32 MiB, 256 MiB, 2 GiB, 16 GiB | ||||
IA-64 | ✔ | ✔ | ✔ | 256 KiB, 1 MiB, 4 MiB, 16 MiB, 256 MiB | |||
SecureRISC? | ✔ | ✔ | ✔ | 256 KiB |
The only very common page size is 4 KiB, with 64 KiB, 2 MiB, and 1 GiB being somewhat common second page sizes. I believe 4 KiB has been carried forward from the 1960s for compatibility reasons as there probably exists some application and device driver software where page size assumptions exist. It would be interesting to know how often UltraSPARC encountered porting problems with its 8 KiB minimum page size. Today 8 KiB or 16 KiB pages make more technical sense for a minimum page size, but application assumptions may suggest keeping the old 4 KiB minimum and introducing at least one more page size to reduce translation cache miss rates.
RISC‑V’s Sv39 model has three page sizes for TLBs to match: 4 KiB, 2 MiB, and 1 GiB. Sv48 adds 512 GiB, and Sv57 adds 256 TiB. The large page sizes were chosen as early outs from multi-level table walks, and don’t necessarily represent optimal sizes for things like I/O mapping or large HPC workloads (they are all derived from the 4 KiB page being used at each level of the table walk). These early outs do reduce translation cache miss penalties, but they do complicate TLB matching, as mentioned earlier. To RISC‑V’s credit, it introduced a new PTE format (under the Svnapot extension) that communicates to processors that can take advantage of it that groups of PTEs are consistent and can be implemented with a larger unit in the translation cache. SecureRISC will adopt this idea.
Even a large memory system (e.g. HPC) will have many small segments (e.g. code segments, small files, non-computational processes such as editors, command line interpreters, etc.), and a smaller page size, such as 8 KiB may be appropriate for these segments. However, 4 KiB is probably not so sub-optimal to warrant incompatibility by not supporting this size. Therefore, the question is what is the most appropriate other page size, or page sizes, besides 4 KiB (which supports up to 2 MiB with one level, and up to 1 GiB with two levels). If only one other page size were possible for all implementations, 256 KiB might be a good choice, since this supports segment sizes up to 233 bytes with one level, and segment sizes of 234 to 248 bytes with two levels. But not all implementations need to support physical memory appropriate to a ≥248‑byte working set.
Instead, it makes sense to choose a second page size in addition to the 4 KiB compatibility size to extend the range of 1 and 2‑level page tables, and then allow implementations targeted at huge physical memories to employ even larger page sizes. In particular, there is a 4 KiB page size intended for backward compatibility, but the suggested page size is 16 KiB. Sophisticated operating systems will primarily operate with a pool of 16 KiB pages, with a mechanism to split these into 4 KiB pages and coalesce these back for applications that require the smaller page size.
SecureRISC has three improvements on paging found in recent architectures. First, it takes advantage of segment sizes to reduce page table walk latency. Second, it allows the operating system to specify the sizes of tables used at each level of the page table walk, rather than tying this to the page size used in translation caches. Decoupling the non-leaf table sizes from the leaf page sizes provides a mechanism that sophisticated operating systems may use for better performance, and on such systems this reduces some of the pressure for larger page sizes. Large leaf page sizes are still however useful for reducing TLB miss rates, and as the third improvement, SecureRISC borrows from RISC‑V and allows the operating system to indicate where larger pages can be exploited by translation caches to reduce miss rates, but without requiring that all implementations do so.
Paging in SecureRISC takes advantage of segment size field in Segment Descriptors to be more efficient than in some architectures. Even a simple operating system—one that specifies tables with the same size at every level—benefits when small segments need fewer levels of tables to cover the specified size specified in the Segment Descriptor. Just because the maximum segment size is 261 bytes doesn’t mean that every segment requires six levels of 4 KiB tables.
Segment descriptors and non-leaf page tables give the page size to be used at the next level, which allows the operating system to employ larger or smaller tables to optimize tradeoffs appropriate to the implementation and the application. Some implementations may add additional page sizes beyond these basic two in their translation cache matching hardware, such as 64 KiB and 256 KiB, some implementation targeting huge memory systems and applications (e.g. HPC) may add even larger pages to target reduced TLB miss rates. The paging architecture allows this flexibility with Page Table Size (PTS) encoding in segment descriptors and non-leaf PTEs, and for leaf PTEs by an encoding borrowed from RISC‑V called NAPOT that allows enabled translation caches to take advantage of multiple consistent page table entries.
As mentioned earlier, the page size that optimizes memory wastage for a single-level page table is proportional to the square root of the memory size, or in a segmented memory, to segment size, and a single-level page table also minimizes the TLB miss penalty, with a 2‑level page table being second best for TLB miss penalty. SecureRISC’s goal is to allow the operating system to choose page sizes per segment that keep the page tables to 1 or 2 levels. It is therefore interesting to consider what segment sizes are supported with this criterion with various page sizes. This is illustrated in the following table, assuming an 8 B PTE:
Page Size | 1-Level | 2-Level | 3-Level | Level | |||
---|---|---|---|---|---|---|---|
Last | Other | bits | |||||
4 KiB | 2 MiB | 1 GiB | 512 GiB | 21 | 30 | 39 | |
4 KiB | 16 KiB | 8 MiB | 16 GiB | 32 TiB | 23 | 34 | 45 |
16 KiB | 32 MiB | 64 GiB | 128 TiB | 25 | 36 | 47 | |
16 KiB | 64 KiB | 128 MiB | 1 TiB | 8 PiB | 27 | 40 | 53 |
64 KiB | 512 MiB | 4 TiB | 32 PiB | 29 | 42 | 55 | |
256 KiB | 8 GiB | 256 TiB | 8 EiB | 33 | 48 | 63 | |
2 MiB | 512 GiB | 128 PiB | 39 | 57 |
The other consideration for page size is covering matrix operations in the L1 TLB. Matrix algorithms typically operate on smaller sub-blocks of the matrices to maximize data reuse and to fit into either the more constraining of the L1 TLB and L2 data cache (with other larger blocking done to fit into the L2 TLB and L3, and smaller blocking to fit into the register file). Matrices are often large enough that each row is in a different page for small page sizes. For an algorithm with 8 B or 16 B per element, each row is in a different page at the following column dimension:
Page size |
Columns | ×1024 rows per page | ||
---|---|---|---|---|
8 B | 16 B | 8 B | 16 B | |
4 KiB | 512 | 256 | 0.5 | 0.25 |
8 KiB | 16 | 512 | 1 | 0.5 |
16 KiB | 2048 | 1024 | 2 | 1 |
64 KiB | 8192 | 4096 | 8 | 4 |
256 KiB | 32768 | 16384 | 32 | 16 |
For large computations (e.g. ≥1024 columns of 16 B elements), every a row increment is going to require a new TLB entry for page sizes ≤16 KiB. Even a 16 KiB page with 16 B per element results in a TLB entry per row. For a L1 TLB of 32 entries and three matrices (e.g. matrix multiply A = A + B × C), the blocking needs to limited to only 8 rows of each matrix (e.g. 8×8 blocking), which is on the low-side for the best performance. In contrast, the 64 KiB page size fits 4 rows in a single page, and so allows 32×32 blocking for three matrices using 24 entries.
If the vector unit is able to use the L2 TLB rather than the L1 TLB for its translation, which is plausible, then these larger page sizes are not quite as critical. A L2 TLB is likely to be 128 or 256 entries, and so able to hold 32 or 64 rows of ×1024 matrices of 16 B elements.
A possible goal for page size might be to balance the TLB and L2 cache sizes for matrix blocking. For example, a L2 cache size of 512 KiB can fit up to 100×100 blocks of three matrices of 16 B elements (total 480 KiB) given sufficient associativity. To fit 100 rows of 3 matrices in the L2 TLB requires ≥300 entries when pages are ≤16 KiB, but only ≥75 entries when pages ≥64 KiB. A given implementation should make similar tradeoffs based on the target applications and candidate TLB and cache sizes, and page size is another parameter that factors into the tradeoffs here. What is clear is that the architecture should allow implementations to efficiently support multiple page sizes if the translation cache timing allows it.
Because multiple page sizes do affect timing-critical paths in the translation caches, it is worth pointing out that implementations are able to reduce the page size stored in translation caches to equal the matching hardware. An implementation could for example synthesize 16 KiB pages for the translation cache even when the operating system specifies a 64 KiB page. This will however increase the miss rate. Conversely, some hardware may support an even larger set of page sizes. SecureRISC adopts the NAPOT encoding from RISC‑V’s PMPs and PTEs (with the Svnapot extension) to allow the TLB to use larger matching for groups of consistent PTEs without requiring it. Thus, it up to implementations whether to adopt larger page matching to lower the TLB miss rate at the cost of a potential TLB critical path. The cost of this feature is one bit in the PTE (taken from the bits reserved for software).
Should it become possible to eliminate the 4 KiB compatibility page size in favor of a 16 KiB minimum page size, it may be appropriate to use the extra two bits the increase the svaddr and siaddr widths from 64 to 66 bits.
TLBs introduce one other complication. Typically, when the supervisor switches from one process to another, it changes the segment and page tables. Absent an optimization, it would be necessary to flush the TLBs on any change in the tables, which is both costly in the cycles to flush and the misses that follow reloading the TLBs on memory references following the switch. Most processors with TLBs introduce a mechanism to reduce how often the TLB must be flushed, such as the Address Space Identifier (ASID) found in the MIPS translation hardware. The ASID is stored in the TLB, and when the supervisor switches to a new process, it either uses the process’ previous ASID, or assigns a new one if the TLB has been flushed since the last time the process ran. This allows its previous TLB entries to be used if they are still present in the TLB, but also avoids the TLB flush. When the ASIDs are used up, the TLB is flushed, and then ASID assignment starts fresh as processes are run. For example, a 5‑bit ASID would then require a TLB flush only when the 33rd distinct process is run after the last flush. The supervisor often uses translation and paging for its own data structures, some of which are process-specific, and some of which are common. To not require multiple TLB entries for the supervisor pages common between processes, a Global bit was introduced in the MIPS and other TLBs. This bit caused the TLB entry to ignore the ASID during the match process; such entries match any ASID. This whole issue occurs a second time when hypervisors switch between multiple guest operating systems, each of which thinks it controls the ASIDs in the TLB. RISC‑V for example introduced a VMID controlled by the hypervisor that works analogously to the ASID.
SecureRISC needs an ASID mechanism and a way to ignore for the same reason as in other ISAs. The question is whether this mechanism needs to be generalized, just as rings are a generalization of supervisor and user mode. I propose one such possible generalization with eight possible sharing opportunities, but whether this is required may be reevaluated. Perhaps SecureRISC will revert to a simple Global bit or just ASID=0 to mean Global. There is no particular reason to choose eight. Below is the mechanism proposed. Again, we expect that various service levels in the system will have some segments common to all of the service levels that they support, and that these should require only a single TLB entry, but that other segments might be change their translation for each supported service level.
The simplest implementation for a Segment Descriptor Table (SDT) is to have a single Segment Descriptor Table Pointer (SDTP) register and use a Global bit in Page Table Entries (PTEs). My alternative ASID generalization is to groups segments into eight groups (SG), and give each group its own SDT, as addressed by eight SDTP registers. These eight registers are then the zero-level table, followed by the chosen Segment Descriptor Table (the first level), followed by zero to four levels of page table. Since the registers are not in memory, there are one to five levels of memory tables to walk starting with the Segment Descriptor Table. The segment size in the SDT allows the length of the walk to be per segment, so most code segments (e.g. shared libraries) will have only one level of page table, but a process with a segment for weather data might require two or three levels (and might use a large page size as well to minimize TLB misses). Some hypervisor segments might be direct-mapped and require only the SDT level of mapping. In addition, if the hypervisor is not paging the supervisor, it might direct map many supervisor segments.
Here are the details. After a TLB miss, the processor starts by using the 3 high bits of the segment field of the address to pick one of eight Segment Descriptor Table registers (sdtp[0] to sdtp[7]). The low 13 bits of the segment field are then an index into the table at the system virtual address in the specified register. The SGS encoding of the sdtp registers is used to bounds check the low 13 bits of the segment number before indexing, which allows each portion of the Segment Descriptor Table to be 512, 1024, 2048, 4096, or 8192 entries (8 KiB to 128 KiB). The sgen register may be used to disable the segment group and all accesses to the group; otherwise, the check is that SGS = 4 | lvaddr60..57+SGS = 04−SGS. If the bound check succeeds, the doubleword Segment Descriptor Entry is read from (sdtp[lvaddr63..61]63..14 ∥ 014) | (svaddr60..48 ∥ 04) and this descriptor is used to bounds check the segment offset, and to generate a system virtual address. When TLB entries are created to speed future translations, they use the Address Space Identifier (ASID) specified in bits 11..0 the selected sdtp.
This method can be used to provide the functionality of two levels of other architectures (i.e. supervisor common using Global=1 and per process using Global=0). A SecureRISC supervisor might simply use segment group 7 (segments 57344–65535) for supervisor common (with ASID=0), and 256-1024 group 0 segments for per process mappings with dynamically assigned ASIDs as they are run. Such a system might set sdtp[7] at initialization, change sdtp[0] on process switch, and leave the other six groups unused (sgen ring fields set to 7).
Each Segment Descriptor Table Pointer register is only readable and writable by the ring specified in the corresponding field of sgen. Other rings must use calls to the appropriate ring to read and write these registers.
71 | 64 | 63 | 12 | 11 | 0 | ||||||
240 | svaddr63..13+SGS | 2SGS | ASID | ||||||||
8 | 51−SGS | 1+SGS | 12 |
Field | Width | Bits | Description |
---|---|---|---|
ASID | 12 | 11:0 | Temporary Address Space ID |
2SGS | 1+SGS | 12+SGS:12 | Encoding of Segment Group Size |
svaddr63..13+SGS | 51−SGS | 63:13+SGS | Pointer to Segment Descriptors for Segment Group |
This section is very preliminary at this point.
The sgen controls which rings can write the various sdtp registers by specifying a 3‑bit ring number per register. Reads or writes to sdtp[i] register or its shadows trap if the current mode or ring number is less than or equal to sgeni×8+2..i×8. It is possible to provide read access separate from write access, but I don’t see the need. Setting a ring number to 7 disables the corresponding sdtp register altogether. Ring numbers less than 3 would typically never be used.
63 | 56 | 55 | 48 | 47 | 40 | 39 | 32 | 31 | 24 | 23 | 16 | 15 | 8 | 7 | 0 | ||||||||
sg7 | sg6 | sg5 | sg4 | sg3 | sg2 | sg1 | sg0 | ||||||||||||||||
8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
7 | 6 | 3 | 2 | 0 | ||
L | F | ring | ||||
1 | 4 | 3 |
Field | Width | Bits | Description |
---|---|---|---|
ring | 3 | 2:0 | Ring for which sdtp is enabled |
F | 4 | 6:3 | Reserved for future use |
L | 1 | 7 | Lock |
The segment descriptor can be thought of the first-level page table, but with a 16 B descriptor instead of an 8 PTE. The first 8 B of the descriptor is made very similar to the PTE format, with the extra permissions, attributes, etc. in the second 8 B of the descriptor.
Possible future changes:
71 | 64 | 63 | 3 | 2 | 0 | ||||||
240 | svaddr63..4+PTS | 2PTS | MAP | ||||||||
8 | 60−PTS | 1+PTS | 3 |
71 | 64 | 63 | 40 | 39 | 32 | 31 | 30 | 29 | 28 | 27 | 26 | 24 | 23 | 22 | 20 | 19 | 18 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 8 | 7 | 6 | 5 | 0 | ||||||||
240 | 0 | SIAO | G1 | G0 | 0 | R3 | 0 | R2 | 0 | R1 | T | 0 | C | P | XWR | 0 | D | ssize | |||||||||||||||||||
8 | 24 | 8 | 2 | 2 | 1 | 3 | 1 | 3 | 1 | 3 | 1 | 2 | 1 | 1 | 3 | 1 | 1 | 6 |
Field | Width | Bits | Description | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAP | 3 | 2:0 |
0 ⇒ invalid SDE: bits 135..72, 63..3 available for software use 2 ⇒ svaddr63..4+PTS is first level page table 7 ⇒ svaddr63..ssize are high-bits of mapping 1, 3..6 Reserved |
||||||||||||||||
2PTS | 1+PTS | 3+PTS:3 | See non-leaf PTE | ||||||||||||||||
svaddr63..4+PTS | 60−PTS | 63:4+PTS |
MAP = 2 ⇒ svaddr63..4+PTS
is first level page table MAP = 3 ⇒ svaddr63..ssize are high-bits of mapping |
||||||||||||||||
ssize | 6 | 5:0 |
Segment size is 2ssize
bytes for 12..61. Values 0..11 and 62..63 are reserved. |
||||||||||||||||
D | 1 | 6 |
Downward segment (must be 0
if ssize ≥ 48) 0 ⇒ address bits 47..ssize must be clear, i.e. = 048−ssize 1 ⇒ address bits 47..ssize must be set, i.e. = 148−ssize |
||||||||||||||||
XWR | 3 | 10:8 |
Read, Write, Execute permission:
|
||||||||||||||||
P | 1 | 11 |
Pointer permission (pointers with segment numbers are permitted) 0 ⇒ Stores of tags 0..199 to this segment take an access fault |
||||||||||||||||
C | 1 | 12 |
CHERI Capability permission 0 ⇒ Stores of tags 232 to this segment take an access fault |
||||||||||||||||
T | 1 | 15 |
0 ⇒ Memory tags give type and size 1 ⇒ Memory tags are clique |
||||||||||||||||
R1 | 3 | 18:16 | Ring bracket 1 | ||||||||||||||||
R2 | 3 | 22:20 | Ring bracket 2 | ||||||||||||||||
R3 | 3 | 26:24 | Ring bracket 3 | ||||||||||||||||
G0 | 2 | 29:28 | Generation number of this segment for GC. | ||||||||||||||||
G1 | 2 | 31:30 | Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this segment traps and software lowers the G1 field. This feature is turned off by setting G1 to 3. | ||||||||||||||||
SIAO | 8 | 39:32 | System Interconnect Attribute (SIA) override, addition, hints, etc. (e.g. cache bypass, as for example seen in most ISAs, such as RISC‑V’s PBMT) |
For direct mapping, the segment mapping consists of:
For segments ≤ 248 bytes, the offset is simply bits 47..0 of the local virtual address, and so the first check is that bits 47..size are zero (or all ones if downward is set in the Segment Descriptor Entry), or equivalently that svaddr47..0 < 2size. For segments > 248 bytes, the offset extends into the segment number field, and no checking need be done during mapping (such sizes are however used during checking address arithmetic), but bits 60..size must be cleared before the logical-or. The second check is that bits size−1..0 of the mapping are zero. The supervisor is responsible for providing the appropriate values in the Segment Descriptor Entries for each portion of segments > 248 bytes. Thus, paging does not need to handle segments larger than 248 bytes (the SDT for such segments is in effect the first level of the page table).
When paging is used, the page tables can be one or more levels deep. Each level has the flexibility to use a different table size, chosen by the operating system when it sets up the tables. A simple operating system might use only a single table size (e.g. 4 KiB or 16 KiB) at every level except the first (which would be a fraction of this single size). The following tables provide examples of how the local virtual address could be used to index levels of the page table for several page and segment sizes in this simple operating system. This is not the recommended way to use SecureRISC’s capabilities, but more of the backward-compatible option. In the figures below, the 13‑bit segment number is split into a 3‑bit segment group (SG) number (used to pick the SDTP register) and the offset (SEG) within that group.
63 | 61 | 60 | 48 | 47 | 21 | 20 | 12 | 11 | 0 | |||||
SG | SEG | 0 | V1 | offset | ||||||||||
3 | 13 | 27 | 9 | 12 |
63 | 61 | 60 | 48 | 47 | 30 | 29 | 21 | 20 | 12 | 11 | 0 | ||||||
SG | SEG | 0 | V1 | V2 | offset | ||||||||||||
3 | 13 | 18 | 9 | 9 | 12 |
63 | 61 | 60 | 48 | 47 | 39 | 38 | 30 | 29 | 21 | 20 | 12 | 11 | 0 | |||||||
SG | SEG | V1 | V2 | V3 | V4 | offset | ||||||||||||||
3 | 13 | 9 | 9 | 9 | 9 | 12 |
63 | 61 | 60 | 48 | 47 | 25 | 24 | 14 | 13 | 0 | |||||
SG | SEG | 0 | V1 | offset | ||||||||||
3 | 13 | 23 | 11 | 14 |
63 | 61 | 60 | 48 | 47 | 46 | 36 | 35 | 25 | 24 | 14 | 13 | 0 | ||||||
SG | SEG | 0 | V2 | V3 | V4 | offset | ||||||||||||
3 | 13 | 1 | 11 | 11 | 11 | 14 |
At the other end of the spectrum, an operating system that is capable of
allocating any power of two size for page tables, and which did not want
to demand page the page tables, might use a single table of
2ssize-14 16 KiB PTEs for most small
segments.
If the segment size is large enough that TLB miss rates are high, the
operating system might allocate the segment’s pages in units of
64 KiB or 256 KiB and use the NAPOT encoding to take advantage
of translation caches that can match sizes greater than 16 KiB.
The follow examples illustrate how SecureRISC’s architecture
might be used by such an operating system. Once again, in the figures
below, the 13‑bit segment number is split into a 3‑bit
segment group (SG) number (used to pick the SDTP register) and the
offset (SEG) within that group.
63 | 61 | 60 | 48 | 47 | 33 | 32 | 18 | 17 | 0 | |||||
SG | SEG | V1 | V2 | offset | ||||||||||
3 | 13 | 15 | 15 | 18 |
The format of a segment page table is multiple levels, with all but the last level consisting of 8 B‑aligned 72‑bit words with integer tags in the following format:
71 | 64 | 63 | 3 | 2 | 0 | ||||||
240 | svaddr63..4+PTS | 2PTS | XWR | ||||||||
8 | 60−PTS | 1+PTS | 3 |
Field(s) | Width | Bits | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XWR | 3 | 2:0 |
0 ⇒ invalid PTE: bits 63..3 available for software 2 ⇒ non-leaf PTE (this figure) 6 Reserved 1, 3..5, 7 indicate a Leaf PTE (see below) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2PTS | 1+PTS | 3+PTS:3 |
Table size of next level is 21+PTS entries
(24+PTS bytes):
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
svaddr63..4+PTS | 60−PTS | 63:4+PTS | Pointer to the next level of table |
The last level (leaf) Page Table Entry (PTE) is a 72‑bit word with an integer tag in the following format:
71 | 64 | 63 | 11 | 10 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 0 | |||||||
240 | svaddr63..12+S | 2S | RSW | D | A | GC | 0 | XWR | |||||||||||
8 | 52−S | 1+S | 3 | 1 | 1 | 2 | 1 | 3 |
Segments are meant as the unit of access control, but including Read, Write, and Execute permissions in the PTE might make ports of less aware operating systems easier. If RWX permissions are not needed in PTEs for operating system ports, then this field could be reduced to just a variable 1-2 bits (one bit for leaf/non-leaf, and a Valid bit only for in leaf PTEs), giving two bits back for another purpose. The most likely use of such a change would be to add two bits to System Virtual Addresses.
Field | Width | Bits | Description | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XWR | 3 | 2:0 |
Read, Write, Execute permission (additional restriction on segment
permissions):
|
||||||||||||||||
GC | 2 | 5:4 | Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this page traps and software lowers the G field. This feature is turned off by setting G to 3. | ||||||||||||||||
A | 1 | 6 |
Accessed: 0 ⇒ trap on any access (software sets A to continue) 1 ⇒ access allowed |
||||||||||||||||
D | 1 | 7 |
Dirty: 0 ⇒ trap on any write (software sets D to continue) 1 ⇒ writes allowed |
||||||||||||||||
RSW | 3 | 10:8 | For software use | ||||||||||||||||
2S | 1+S | 11+S:11 | This encodes the page size as the number of 0 bits followed by a 1 bit. If bit 11 is 1, then there are zero 0 bits, and S=0, which represents a page size of 212 bytes (4 KiB). | ||||||||||||||||
svaddr63..12+S | 52−S | 63:12+S |
For last level of page table, this is the translation For earlier levels, this is the pointer to the next level |
As example of the NAPOT 0S encoding, the following examples illustrate three page sizes:
71 | 64 | 63 | 12 | 11 | 10 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 0 | ||||
240 | svaddr63..12 | 1 | RSW | D | A | GC | 0 | XWR | |||||||||
8 | 52 | 1 | 3 | 1 | 1 | 2 | 1 | 3 |
71 | 64 | 63 | 14 | 13 | 12 | 11 | 10 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 0 | ||||
240 | svaddr63..14 | 1 | 02 | RSW | D | A | GC | 0 | XWR | ||||||||||
8 | 50 | 1 | 2 | 3 | 1 | 1 | 2 | 1 | 3 |
71 | 64 | 63 | 18 | 17 | 16 | 11 | 10 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 0 | |||||
240 | svaddr63..18 | 1 | 06 | RSW | D | A | GC | 0 | XWR | |||||||||||
8 | 46 | 1 | 6 | 3 | 1 | 1 | 2 | 1 | 3 |
SecureRISC‛s System Interconnect Address Attributes (SIAA) are inspired by RISC‑V’s Physical Memory Attributes (PMA). SIAAs are specified on Naturally Aligned Powers of Two (NAPOT) siaddrs. The first attribute is the memory type, described below. Attributes are further distinguished for some memory types based on the cacheability software chooses for a portion of the NAPOT address space. Cacheability options are instruction and data caching with a specified coherence protocol, instruction and data caching without coherence, instruction caching only, and uncached. The set of coherency protocols to be enumerated is TBD, but is likely to include at least MESI and MOESI. Uncached instruction accesses may require full cache block transfers on some processors to keep things simpler, and the cache block transfer used multiple times before being discarded on a reference to another cache block (so there is a limited amount of caching even for uncached instruction accesses).
The attributes are organized into the following categories:
Category | Applicability | |||
---|---|---|---|---|
Void | ROM | Main | I/O | |
Memory type | ✔ | ✔ | ✔ | ✔ |
Dynamic configuration (e.g. hotplug) | ✔ | ✔ | ✔ | |
Non-volatile | 1 | ✔ | ✔ | |
Error correction: type (e.g. SECDED, Reed-Solomon, etc.) and granularity (e.g. 72, 144, etc. bits) |
✔ | ✔ | ✔ | |
Error reporting (how detected errors are reported) | ✔ | ✔ | ✔ | |
Mandatory Access Control Set | ✔ | ✔ | ✔ | |
Read access widths supported | ✔ | ✔ | ✔ | |
Write access widths supported | ✔ | ✔ | ||
Execute access widths supported | ✔ | ✔ | ✔ | |
Uncached Alignment | ✔ | ✔ | ✔ | |
Uncached Atomic Compare and Swap (CAS) widths | ✔ | ✔ | ||
Uncached Atomic AND/OR widths | ✔ | ✔ | ||
Uncached Atomic ADD widths | ✔ | ✔ | ||
Coherence Protocols (e.g. uncached, cached without coherence, cached coherent (one of MESI, MOESI), directory-based coherence type) | ✔ | ? | ||
NUMA location (for computing distances) | ✔ | ✔ | ✔ | |
Read idempotency | 1 | 1 | ✔ | |
Write idempotency | 1 | ✔ |
Memory type is one of four values:
Width | Tag | Align | Comment | |
---|---|---|---|---|
UT | T | |||
8 | ✔ | TI | any | LX8*, LS8*, SX8*, SS8*, etc. |
16 | ✔ | TI | 0..62 mod 64 | Crossing cache block boundary not supported |
32 | ✔ | TI | 0..60 mod 64 | Crossing cache block boundary not supported |
64 | ✔ | TI | 0..56 mod 64 | Crossing cache block boundary not supported |
72 | ✔ | 0 mod 8 | Uncached LA*, LX*, LS*, SA*, SX*, SS*, etc. | |
128 | ✔ | TI | 0..48 mod 64 | Crossing cache block boundary not supported |
144 | ✔ | 0 mod 16 | Uncached LAD*, LXD*, LSD*, SAD*, SXD*, SSD*, etc. | |
256 | ✔ | TI | 0..32 mod 64 | Uncached vector load/store |
288 | ✔ | TI | 0 mod 32 | Uncached vector load/store |
512 | ✔ | TI | 0 mod 64 | Uncached vector load/store, cached untagged refill and writeback |
576 | ✔ | 0 mod 64 | Uncached vector load/store, cached tagged refill and writeback | |
768 | ✔ | 0 mod 64 | Cached tagged refill and writeback with encryption |
In the table above, the UT column indicates untagged memory support, the T column indicates tagged memory support, and the TI entry in the tagged column indicaes Tagged Immediate, defined on tagged memory where the word contains a tag in the range 240..255. Untagged memory supplies a 240 tag to the system interconnect on a read, and requires a 240 tag from the system interconnect on writes. Tagged writes (cached or uncached) to untagged memory siaddrs fail if the tag is not 240. Main memory and ROMs may impose additional uncached alignment requirements (e.g. Naturally Aligned Power Of Two (NAPOT) rather than arbitrary alignment within cache blocks).
Main memory must support reads and writes. ROMs only support reads. I/O memory may support reads, writes, or both, and may be idempotent or non-idempotent.
Attribute | Width | ||
---|---|---|---|
512 | 576 | 768 | |
Read | ☐ | ☐ | ☐ |
Write | ☐ | ☐ | ☐ |
Execute | ☐ | ☐ | |
Coherence protocols | TBD |
Attribute | Width | ||
---|---|---|---|
512 | 576 | 768 | |
Read | ☐ | ☐ | ☐ |
Write | n.a. | ||
Execute | ☐ | ☐ | |
Coherence protocols | n.a. |
Attribute | Width | ||
---|---|---|---|
512 | 576 | 768 | |
Read | TBD | ||
Write | |||
Execute | |||
Coherence protocols |
Attribute | Width | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 16 | 32 | 64 | 72 | 128 | 144 | 256 | 288 | 512 | 576 | 768 | |
Read | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ |
Write | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ |
Execute | 0 | ☐ | 0 | ☐ | 0 | ☐ | 0 | ☐ | ☐ | |||
Atomic CAS | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ |
Atomic AND/OR | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ |
Atomic ADD | ☐ | ☐ | ☐ | ☐ | 0 | ☐ | 0 | |||||
Coherence protocols | n.a. | |||||||||||
Read Idempotency | 1 | |||||||||||
Write Idempotency | 1 |
Attribute | Width | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 16 | 32 | 64 | 72 | 128 | 144 | 256 | 288 | 512 | 576 | 768 | |
Read | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ |
Write | 0 | |||||||||||
Execute | 0 | ☐ | 0 | ☐ | 0 | ☐ | 0 | ☐ | ☐ | |||
Atomic CAS | 0 | |||||||||||
Atomic AND/OR | 0 | |||||||||||
Atomic ADD | 0 | |||||||||||
Coherence protocols | n.a. | |||||||||||
Read Idempotency | 1 | |||||||||||
Write Idempotency | n.a. |
Attribute | Width | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 16 | 32 | 64 | 72 | 128 | 144 | 256 | 288 | 512 | 576 | 768 | |
Read | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ |
Write | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ |
Execute | 0 | ☐ | 0 | ☐ | 0 | ☐ | 0 | ☐ | ☐ | |||
Atomic CAS | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ |
Atomic AND/OR | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ | ☐ |
Atomic ADD | ☐ | ☐ | ☐ | ☐ | 0 | ☐ | 0 | |||||
Coherence protocols | n.a. | |||||||||||
Read Idempotency | ☐ | |||||||||||
Write Idempotency | ☐ |
Tagged memory is an attribute derived from the above. Tagged is true for ROM and main memory that supports uncached 72‑bit reads or cached 576‑bit or 768‑bit (for authentication and optional encryption) reads and optionally writes. Untagged memory supports some subset of uncached 8‑bit, …, 64‑bit, 128‑bit reads and optionally writes, or cached 512‑bit reads and optionally writes, and supplies a 240 tag on read, and accepts a 240 or 241 tag on writes. Code ROM (e.g. the boot ROM) might support only tags 241 and 252.
Encryptable is an attribute derived from the above. Encryptable is true for ROM and main memory that supports cached 768‑bit (for authentication and optional encryption) reads and optionally writes.
CHERI capable is an attribute derived from the above. CHERI capable is true for tagged main memory that supports tags 240, 232, and 251. This could be a cacheable 512‑bit that synthesizes tags on read from a a in-DRAM tag table with cache and compression.
After 64‑bit Local Virtual Addresses (lvaddrs) are mapped to 64‑bit System Virtual Addresses (svaddrs), these 64‑bit svaddrs are mapped to 64‑bit System Interconnect Addresses (siaddrs). This mapping is similar, but not identical to the mapping above as it starts with a 16‑bit region number rather than one of eight 13‑bit segment numbers. There is one such mapping set by the hypervisor for the entire system using a Region Descriptor Table (RDT) at a fixed system address. The RDT may be hardwired, or read-only, on read/write by the hypervisor. For the maximum 65,536 regions, with 16 bytes for a RDT entry, the maximum size RDT is 1 MiB in size. A system configuration parameter allows the size of the RDT to be reduced when the full number of regions is not required (which is likely).
The format of the Region Descriptor Entries is shown below. It is similar to Segment Descriptor Entries, but without the D, X, P, C, R1, R2, R3, G0, G1, and SIAO fields, and with the addition of the MAC, and ATTR fields.
A possible future addition would be a permission bit that prohibits execution from privileged rings. Alternatively, there could be a mandatory access bit required in MAC for this.
71 | 64 | 63 | 3 | 2 | 0 | ||||||
240 | siaddr63..4+PTS | 2PTS | MAP | ||||||||
8 | 60−PTS | 1+PTS | 3 |
71 | 64 | 63 | 32 | 31 | 28 | 27 | 16 | 15 | 14 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 0 | ||||||
240 | ATTR | ENC | MAC | 0 | RPT | 0 | WR | 0 | rsize | |||||||||||||||
8 | 32 | 4 | 12 | 1 | 3 | 2 | 2 | 2 | 6 |
Field | Width | Bits | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAP | 3 | 2:0 |
0 ⇒ invalid RDE: bits 135..72, 63..3 available for software use 2 ⇒ siaddr63..4+PTS is first level page table 3 ⇒ siaddr63..rsize are high-bits of mapping 1, 4..7 Reserved |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2PTS | 1+PTS | 3+PTS:3 |
Table size of next level is 21+PTS entries
(24+PTS bytes):
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
siaddr63..4+PTS | 60−PTS | 63:4+PTS |
MAP = 2 ⇒ siaddr63..4+PTS
is first level page table MAP = 3 ⇒ siaddr63..rsize are high-bits of mapping |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
rsize | 6 | 5:0 |
Region size is 2rsize
bytes for 12..61. Values 0..11 and 62..63 are reserved. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
WR | 2 | 9:8 | Write Read permission | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RPT | 3 | 14:12 |
Region Protection Table ring Accesses by rings less than or equal to this value apply permissions specified by rptp. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MAC | 12 | 27:16 | Mandatory Access Set | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ENC | 4 | 31:28 |
Encryption index 0 ⇒ no encryption 1..15 ⇒ index into table giving algorithm and 256‑bit key |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ATTR | 32 | 63:32 | Physical Memory Attributes |
The format of a region page table is multiple levels, each level consisting of 72‑bit words with integer tags in the same format as PTEs for local virtual to system virtual mapping, except there is no X or G fields.
71 | 64 | 63 | 3 | 2 | 0 | ||||||
240 | siaddr63..4+PTS | 2PTS | XWR | ||||||||
8 | 60−PTS | 1+PTS | 3 |
Field | Width | Bits | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XWR | 3 | 2:0 |
0 ⇒ invalid PTE: bits 63..3 available for software 2 ⇒ non-leaf PTE (this figure) 1, 3 indicate valid Second-Level Leaf PTE (see below) 4..7 Reserved |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2PTS | 1+PTS | 3+PTS:3 |
Table size of next level is 21+PTS entries
(24+PTS bytes):
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
siaddr63..4+PTS | 60−PTS | 63:4+PTS | Pointer to the next level of table |
The Second-Level Leaf Page Table Entry (PTE) is a 72‑bit word with an integer tag in the following format:
71 | 64 | 63 | 11 | 10 | 8 | 7 | 6 | 5 | 3 | 2 | 0 | ||||||||
240 | siaddr63..12+S | 2S | RSW | D | A | 0 | XWR | ||||||||||||
8 | 52−S | 1+S | 3 | 1 | 1 | 3 | 3 |
Field | Width | Bits | Description | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XWR | 3 | 2:0 |
Read, Write permission:
|
||||||||||
A | 1 | 6 |
Accessed: 0 ⇒ trap on any access (software sets A to continue) 1 ⇒ access allowed |
||||||||||
D | 1 | 7 |
Dirty: 0 ⇒ trap on any write (software sets D to continue) 1 ⇒ writes allowed |
||||||||||
RSW | 3 | 10:8 | For software use | ||||||||||
2S | 1+S | 11+S:11 | This encodes the page size as the number of 0 bits followed by a 1 bit. If bit 11 is 1, then there are zero 0 bits, and S=0, which represents a page size of 212 bytes (4 KiB). | ||||||||||
siaddr63..12+S | 52−S | 63:12+S |
For last level of page table, this is the translation For earlier levels, this is the pointer to the next level |
Reading Segment Descriptor Entries (SDEs) from the Segment Descriptor Table (SDT) and Region Descriptor Entries (RDEs) from the Region Descriptor Table (RDT) would typically be done through the L2 Data Cache. Since the L2 Data Cache is coherent with respect to this and other processors in the system, it is possible that L2 Data Cache might note that the TLB contains entries from the line and send an invalidate to the TLB when the L2 line is invalidated. This might avoid the need for some TLB flushes. However, this requires the L2 to store the TLB location, which might require 8 bits per L2 tag. It is unclear whether this is worthwhile.
The hypervisor has a unified 64‑bit address space of System Virtual Addresses (svaddrs) divided into 65536 regions. (The purpose of the unified address space simplifies the communication between supervisors and I/O devices.) The hypervisor allocates regions to supervisors, for example using the region descriptors to only allocate the appropriate portion of memory and I/O spaces to them. In a unified address space, each supervisor is capable of attempting references to the addresses of other supervisors or even to hypervisor addresses. Only the region protection mechanism prevents such access attempts from succeeding. The first level of protection is simple Read and Write permissions that the hypervisor sets for each region and supervisor. This is implemented as a 65536-entry table, one for each region, of 2‑bit values (16 KiB):
Value | Permission |
---|---|
0 | None |
1 | Read-only |
2 | Reserved |
3 | Read and write |
At this time, I don’t see the need to add per region read and write ring brackets to these permissions. The unified region descriptor table does specify the rings that employ these permissions, which allows the hypervisor access to its own regions on any entry to hypervisor code.
For SecureRISC processors, the hypervisor specifies region permissions by storing a siaddr to the table in the rptp CSR. This would typically be context switched by the hypervisor. While most supervisors would have a unique rptp value, in theory a single protection domain could be shared by a cooperating group of supervisors. Region protection is cached in translation caches along with other permissions and the lvaddr→siaddr mapping. The PDID field exists to allow cached values to be differentiated when the rptp value or its contents changes in a fashion similar to the ASID field of sdtp registers.
71 | 64 | 63 | 11 | 10 | 0 | ||||||
240 | siaddr63..12+PDS | 2PDS | PDID | ||||||||
8 | 52−PDS | 1+PDS | 11 |
Field | Width | Bits | Description |
---|---|---|---|
PDID | 12 | 10:0 | Protection Domain ID |
2PDS | 1+PDS | 11+PDS:11 | Encoding of Region Protection Table Size |
siaddr63..12+PDS | 52−PDS | 63:12+PDS | Pointer to Region Protection Table |
Because translation cache misses in many microarchitectures will access the Region Protection Table through the L2 data cache, the hypervisor may find it benefits performance to allocation regions to supervisors in a clustered fashion, so that a single L2 data cache line serves all RPT accesses during a supervisor’s quantum.
Non-processor initiating ports into the system interconnect (Initiators) are limited in which regions they are permitted to access. One option is similar functionality to the processor mechanism described above. For example, a port might have a configuration register equivalent to rptp that the hypervisor can set, or something simpler or more complex depending the complexity of the entities using the port for region access.
Another possibility is that each Initiator is programmed by the hypervisor with two or more Mandatory Access Control (MAC) sets. One is for the Initiator’s TLB accesses, and the others are for accesses made by agents that the Initiator services. The MAC set for a region is stored as part of the Region Descriptors and cached in the Initiator’s TLB. The Initiator tests each access and rejects those that fail. Read access requires RegionCategories ⊆ InitiatorCategories and Write access requires RegionCategories = InitiatorCategories. For example, the Region Descriptor Table and the page tables those reference might have a Hypervisor bit that would prevent reads and writes from anything but Initiator TLBs. Processors would have Mandatory Access Control sets per ring. This would allow the same system to support multiple classification levels, e.g. Orange Book Secret and Top-Secret, with Top-Secret peripherals able to read both Secret and Top-Secret memory, but Secret peripherals denied access to Top-Secret memory.
Encryption might also be used to protect multiple levels of data in a system. For example, if Secret and Top-Secret data in memory are encrypted with different keys, and Secret Initiators are only programmed with that encryption key, then reading Top-Secret memory will result in garbage being read and writing Top-Secret data from a peripheral to Secret memory will result in that data being garbage to a processor or another peripheral with only the Secret key.
Because encryption results only in data being unintelligible, it is more difficult to debug. It may be desirable to employ both MAC sets and encryption.
An optional system feature of Region Descriptor Entries (RDEs) is to specify that the contents of the memory of the region should be protected by authenticated encryption on a cache line basis. If the keys are sufficiently protected, e.g. in a secure enclave, the data may be protected even when system software is compromised. A separate table in the secure enclave gives the symmetric encryption key for encrypting and decrypting data transferred to and from the region and the system interconnect address would be used as the tweak. The challenge of cache line encryption, with only 64 bytes of data, is providing sufficient security with a smaller storage overhead than is typical for larger data blocks, while keeping the added latency of a cache miss minimal.
Cache lines are 576 bits. To encrypt, use a standard 128‑bit block cipher (e.g. standard AES128) five times in counter mode using 128 bits of the key to xor, producing ciphertext. Append a 64‑bit authentication code and the 64‑bit nonce used for encryption and authentication yielding 704 bits. The authentication code is a hash of the 576‑bit ciphertext added to the other 128 bits of the key applied to a different counter value. Add 8 ECC bits to each 88 bits produces a memory line of 768 bits. Main memory might then be implemented with three standard 32‑bit or 64‑bit DRAM modules. Reads of encrypted memory would compute the 576 counter mode xor bits during the read latency, resulting in a single xor when the data arrives at the system interconnect port boundary (either 96, 192, 384, or 576 bits per cycle). This xor would be much less time than the ECC check. Writes would incur the counter mode computation latency (primarily six AES computations). Because the memory width and interconnect fabric would be sized for encryption, the only point in not encrypting a region would be to reduce write latency or to support non-block writes (it being impossible to update the authentication code when without doing a read, modify, write).
Encryption would not be supported for untagged memory, as the purpose of untagged memory is primarily for I/O devices. Were encryption to be supported it would have to be with a tweakable block cipher (e.g. XTS-AES), because such memory would not support the extra bits required for tags and authentication.
In particular the encryption of the 576‑bit (ignoring ECC) cache line CL to a 768‑bit memory line ML (including ECC) using cache line address siaddr63..6 and the 64‑bit internal state nextnonce would be as follows:
nonce ← nextnonce
nextnonce ← (nextnonce ⊗ 𝑥) mod 𝑥64+𝑥4+𝑥3+𝑥+1
T0 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥000000)
T1 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥000001)
T2 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥000010)
T3 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥000100)
T4 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥001000)
T5 ← AESenc(Key255..128, nonce63..0∥siaddr63..6∥100000)63..0
C ← CL ⊕ (T4∥T3∥T2∥T1∥T0)
A0 ← C63..0 ⊗ K0 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A1 ← C127..64 ⊗ K1 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A2 ← C191..128 ⊗ K2 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A3 ← C255..192 ⊗ K3 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A4 ← C319..256 ⊗ K4 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A5 ← C383..320 ⊗ K5 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A6 ← C447..384 ⊗ K6 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A7 ← C511..448 ⊗ K7 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A8 ← C575..512 ⊗ K8 mod 𝑥64+𝑥4+𝑥3+𝑥+1
AC ← A8⊕A7⊕A6⊕A5⊕A4⊕A3⊕A2⊕A1⊕A0⊕T5
AE ← C ∥ nonce63..0 ∥ AC
M0 ← ECC(AE87..0) ∥ AE87..0
M1 ← ECC(AE175..88) ∥ AE175..88
M2 ← ECC(AE263..176) ∥ AE263..176
M3 ← ECC(AE351..264) ∥ AE351..264
M4 ← ECC(AE439..352) ∥ AE439..352
M5 ← ECC(AE527..440) ∥ AE527..440
M6 ← ECC(AE615..528) ∥ AE615..528
M7 ← ECC(AE703..616) ∥ AE703..616
ML ← M7 ∥ M6 ∥ M5 ∥ M4 ∥ M3 ∥ M2 ∥ M1 ∥ M0
The inverse of the above for decrypting ML to CL and checking the authentication is obvious. A variant where a 144‑bit block cipher with a 144‑bit key (e.g. AES with 9‑bit S-boxes or an obvious Simon144/144) is used instead of 128‑bit AES is fairly obvious, and might make sense for datapath width matching, but the nonce and authentication would remain 64 bits to fit the result in 768, which probably makes datapath matching consideration secondary, and the extra key width is a slight annoyance (but see PQC note below where a 144‑bit key might be an advantage).
It is not yet determined whether K0, K1, …, K8 are constants or generated from the 256‑bit key via a key schedule algorithm, or simply provided by the software.
The table in the secure enclave specifies up to 15 algorithm and key pairs (where typically a key is an encryption key and authentication key pair).
Value | Encryption | Auth | Extra bits | What |
---|---|---|---|---|
0 | None | None | 0 | no encryption, no authentication |
1 | None | CWMAC | 64+64 |
No encryption authentication using 64‑bit Carter-Wegman with 64‑bit nonce and Key255..128 |
2 | AES128 | CWMAC | 64+64 |
AES128 CTR encryption with 64‑bit nonce, 64‑bit tweak 64‑bit Carter-Wegman Authentication code Key127..0 used for AES128 CTR, Key255..128 used for authentication |
3..15 | Reserved |
It is possible that Simon128/128 could be used in place of AES128 to reduce the amount of area required. The area of 16 S-boxes for one round AES being somewhat expensive, and six iterations of 14 rounds is too slow, so perhaps 96 S-boxes are required to keep the write latency latency reasonable (the read latency being covered by the DRAM access time with this many S-boxes).
Post-Quantum Cryptography (PQC) may require a 192‑bit or 256‑bit key due to Grover’s Algorithm. AES192 and AES256 however require 12 and 14 rounds respectively (20% and 40% more than AES128), which may add too much latency to cache block write-back, which is already somewhat affected by 10 rounds of AES128, which are each relatively slow due the S-box complexity. It is possible that Simon128/192 or Simon128/256 become better choices at larger key sizes, as 192‑bit keys are only 1.5% and 5.9% additional rounds. On the other hand, it is also possible to use additional S-boxes for parallel AES computation. AES S-boxes are somewhat costly, which argues against this, but in counter-mode encryption inverse S-boxes are not required, so perhaps this is acceptable. For example, by using 32 S-boxes, the computations specified above allow for producing two of the six computations in parallel, with the S-boxes being used only three times rather than six. It would be nice to have cryptographers weigh in some of these issues (this author is definitely not qualified).
Given the 8×88=704 bits to be protected with 64 bits of ECC, which can detect up to 16 bits of errors and correct up to 8, it might be interesting to consider Reed-Soloman error correction and detection for block of 88 8‑bit codewords with eight check symbols, which would be able to detect up to 8 symbol errors (32 to 64 bits) and correct up to 4 symbol errors (16 to 32 bits). However, the latency of detection and ECC generation for cache fills becomes an issue.
SecureRISC processors have three levels of Reset and one Non-maskable Interrupt (NMI):
Power-on Reset is required when power is first applied to the processor, and may require thousands of cycles, during which time various non-digital circuits may be brought into synchronization. After this initialization, Power-on Reset is identical to Hard Reset. Hard Reset forces the processor to reset even when there are outstanding operations in process, e.g. queued reads and writes, and will require system logic to be similarly reset to maintain synchronization. Power-on Reset and Hard Reset begin execution at the same hardwired ROM address. Soft Reset simply forces the processor to begin executing at the separate Soft Reset ROM address, while maintaining its existing interface to the system interface (e.g. queued reads and writes). Soft Reset may be used to restart a processor that has entered the Halt state. Finally Non-Maskable Interrupts cause an interrupt to a ring 7 address for ultra-timing-critical events. NMIs are initiated with an edge-triggered signal and should not be repeated while an earlier NMI is being handled. Timing-critical events that can be delayed during other interrupt processing should use normal message interrupts, to be serviced at their specified interrupt priority.
Power-on Reset and Hard Reset begin with the vmbypass, icachebypass, and dcachebypass bits set. The first forces an lvaddr→siaddr identity mapping. This allows the hardwired reset PC to be fetched from a system ROM and initialize the rest of the processor state, including the lvaddr→svaddr and svaddr→siaddr translation tables. At this point it should clear the vmbypass bit. vmbypass cannot be reenabled once clear, and thus is available only to the Boot ROM.
The Boot ROM is expected to initialize the various instruction fetch caches and then clear the icachebypass bit. Once clear, this bit may not be reenabled except by Power-on or Hard Reset. Next the Boot ROM is expected to initialize the various data caches and clear the dcachebypass bit. This bit also may not be reenabled except by Power-on or Hard Reset. Finally the Boot ROM is then responsible for starting the Root of Trust verification process and once that is complete, transferring to the hypervisor.
SecureRISC processors reset in an implementation-specific manner. During all three resets, the architecture requires some processor state to be set to specific values, and other state is left undefined and must be initialized by the boot ROM. In particular the following is required:
State | Initial value | Comment |
---|---|---|
PC | 0xC6FFFFFFFFFF000000 |
Basic block descriptor pointer, ring 7, 4 MiB from end of address space |
InterruptEnable[7] | 0 | All ring 7 interrupts disabled. |
vmbypass | 1 | Force lvaddr→siaddr identity map. |
icachebypass | 1 | Bypass all instruction fetch caching. |
dcachebypass | 1 | Bypass all data cache caching. |
I expect both moderately speculative (e.g. 3-4 wide in-order with branch prediction) and highly speculative (e.g. 4-12 wide Out-of-Order (OoO)) implementations of SecureRISC to be appropriate, albeit with the highly-speculative implementations having solutions for Meltdown, SPECTRE, Foreshadow, Downfall, Inception, etc. and similar attacks that result from speculation. The moderately speculative processors are likely to be less vulnerable to future attacks, and the ISA should strive to enable such processors to still perform well (i.e. not depend upon OoO for reasonable performance, only for the highest performance). This is one reason I prefer the AR/XR/SR/BR/VR model (inspired by the ZS-1), where operations on the ARs/XRs may get ahead of operations on the SRs/BRs/VRs, and end up generating pipelined cache misses on SR/VR load/store without stalling, thus being more latency tolerant. This is likely to work well for floating-point values, which naturally will be allocated to the SRs/VRs, but will depend on the compiler to put non-address generation integer arithmetic in the SRs/VRs. It may be that some microarchitectures will choose to handle SR load/store from the L2 data cache due to this latency tolerance, and the SR execution units will end up operating by at least the L2 data cache latency after the AR/XR execution units, causing branch mispredicts on BRs to have additional penalty, and for moves from SRs back to ARs to be costly, but this is better than penalizing every SR load miss.
An OoO implementation might choose to rename the AR/XR/SR registers to a unified physical register file but doing so would give up the reduced number of register file read and write ports that separating these files offers. I expect the preferred implementation will rename each to their own physical files.
The following example goes for full OoO (rather than
the moderately speculative
possibility mentioned above) but
exploits the AR/XR
vs. SR separation by
targeting SR/VR/VM load/store to the L2
data cache. The L1 data cache exists for address generation
acceleration.
The challenge with highly speculative microarchitectures is avoiding vulnerabilities such as Spectre, Meltdown, RIDL, Foreshadow, Inception, etc. One possibility under consideration (not detailed in the table below) is to have all caches (including translation and control-flow caches) have a per-index way dedicated to speculative fills, and when the fill becomes not speculative, then you designate a different way as the speculative fill way for that index. Speculation pipeline flushes have to then kill the speculative fills, which is likely to reduce performance, so it might be necessary to introduce a per-process option. It is also a potential performance issue that there would only be one speculative fill way per index. It is the control-flow caches that are the most problematic because they usually have only probabilistic matching, but Inception shows that there is a potential hole here.
Another general consideration when employing speculative execution is to carefully separate privilege levels in microarchitectural state. For example, low-privilege execution should not be able to affect the prediction (branch, cache, prefetch, etc.) of higher privilege levels, or different processes at the same privilege level. Flushing microarchitectural state would be sufficient, but would unacceptably affect performance, so where possible, privilege level and process identifiers should be included in the matching used in microarchitectual optimizations (e.g. prediction). For example, the Next Descriptor Index and Return Address predictors suggested below include the ring number in its tag to prevent one class of attacks. For bypassing based upon siaddrs, a ring may be included; if the ring of the data is greater than the execution ring, this should force a fence. This does not address an attack from one process to another at the same privilege level, which would require inclusion of some sort of process id, which might be too expensive.
Note: Size in Kib (1024 bits) and Mib (1048576 bits) below do not
include parity, ECC, column, or row redundancy. A +
is appended
to indicate there may be additional SRAM bits.
Structure | Description | |
---|---|---|
Basic Block Descriptor Fetch | ||
Predicted PC | 62‑bit lvaddr and 3‑bit ring | |
Predicted loop iteration |
64‑bit count (initially from prediction, later from LOOPX) 64‑bit iteration (no loop back when iteration = count) 1‑bit Boolean whether LOOPX value received 64‑bit BB descriptor address with c set that started the loop |
|
Predicted CSP |
8×(61+3)‑bit 61‑bit lvaddr63..3 and 3‑bit ring |
|
Predicted Basic Block Count | 8‑bit bbc | |
Predicted Basic Block History |
128‑entry circular buffer indexed by bbc6..0 (see below) ~9 Kib, not including register rename snapshot |
|
L1 BB Descriptor TLB |
256 entry, 8‑way set associative,
640‑bit line (8 SDE/PTEs), mapping lvaddr63..12 to siaddr63..12 in parallel with BB Descriptor Cache, line index: lvaddr14..12, set index: lvaddr16..15, tag: lvaddr63..17, data: siaddr63..12, XWR, R1/R2/R3, etc. (80 bits), filled from L2 Descriptor/Instruction TLB with 640‑bit read 20+ Kib data, 1.5+ Kib tag |
|
L2 BB Descriptor TLB |
2048 entry, 8‑way set associative,
640‑bit line (8 SDE/PTEs), line index: lvaddr14..12, set index: lvaddr19..15, tag: lvaddr63..20, data: siaddr63..12, XWR, R1/R2/R3, etc. (80 bits), filled from L2 Data Cache with 512‑bit read and augmented with SDE bits 160+ Kib data, 5.6+ Kib tag |
|
BB Descriptor Cache |
4096 descriptors (65 bits each), 8‑way set associative, 520#8209;bit line size, 65‑bit read, 520‑bit tagged refill, line index: lvaddr5..3, index: lvaddr11..6, tag: siaddr63..12, 1.5 cycles latency, 2 cycles to predicted PC, filled from L2 Descriptor/Instruction Cache on miss and by prefetch. Might include some branch prediction bits that are initialized from hint bits, but then updated (whether to do this depends on the whether a separate write port is required, in which case a separate RAM is probably appropriate). For example, a simple 2‑bit counter might serve as a first stage for YAGS or TAGE. 260+ Kib data, 26+ Kib tag |
|
Next Descriptor Index Predictor |
128×(9+10+3), direct mapped (sized to access in less than a cycle) index: lvaddr9..3, tag: lvaddr19..10 + 3‑bit ring, data: lvaddr11..3, 1 cycle to predicted BB Descriptor Cache index, This predictor is accessed in parallel with the BB Descriptor Cache (BBDC). It contains the most recent flow change hits from the BBDC and is used to accelerate fetch of next BB Descriptor by starting a new BBDC read 1 cycle after the last. If 2-cycle BBDC access and prediction yields the same index, then the read of the target BB Descriptor is accelerated by one cycle. If predicted next index differs then BBDC value fetched early is discarded. The ring is included in the tag, and the data is only used if PC.ring ≤ tag.ring. 2.75+ Kib |
|
Return Address Prediction |
The committed version of return addresses are stored on per-ring
call stacks in memory. This structure maintains speculative
versions of those lines for the BB
Descriptor next field
types Call, Conditional Call, Call Indirect,
and Conditional Call Indirect.
Exceptions also speculatively update this structure. Attempts to
write a line not in this structure fetch the line from memory
unless CSP[PC.ring]5..3 = 0,
since in that case the call stack is initializing a new line. Lines from this structure are never written back to memory. This structure is read on the BB Descriptor next field type Return and Exception Return to predict the target PC. Unlike other microarchitecture Return Address Stacks, this structure is line-oriented, tagged, and searched by the predicted CSP[PC.ring], and may be filled from a line at a time from memory with non-speculative values as needed (and thus more likely to predict successfully compared to the typical wrapping Return Address Stack or after a context switch that changes CSP[PC.ring]). It is 8 entries and fully associative to handle cross-ring call and return gracefully. index: lvaddr5..3, tag: ASID ∥ lvaddr63..6 + 3‑bit ring An entry matches only if PC.ring = tag.ring. 4.5+ Kib data, 464+ bits tag |
|
Branch Predictor |
~16 KiB BATAGE Whisper add-on? Consider using YAGS with 8192 entries of 2‑bit saturating counters in the choice table, and 1024 entries of 2‑bit saturating counters with 8‑bit tags for the T and NT tables (total 36,864 bits) as a replacement for the first two TAGE stages. ~128 Kib |
|
Indirect Jump/Call Predictor | ~16 KiB ITTAGE? | |
Loop Count Predictor |
Predict loop count after fetching BB descriptor
with c set. TAGE-like, based on history, no hit is equivalent to 216−1 first-level (no history): 128 entry, 2‑way set associative index: lvaddr8..3 of BB descriptor with c set tag: lvaddr16..9 + 3‑bit ring data: 16‑bit count (0..65535) Prediction used only if PC.ring ≤ tag.ring. Written only on mispredicts that occur prior to LOOPX value received. 2+ Kib first-level data, 1+ Kib tag (other levels TBD) |
|
BB Fetch Output | 8‑entry BB Descriptor Queue of PC, BB type, fetch count, fetch siaddr63..2, instruction start mask, branch and jump prediction to check | |
Instruction Fetch | ||
L1 Instruction Cache |
2048 entry (128 KiB),
4‑way set associative,
512‑bit line, read, write index: siaddr14..6, tag: siaddr63..15, 2-cycle latency, use 0*-2 times per basic block descriptor, so 0 or 2-3 cycles for entire BB instruction fetch, filled from L2 Descriptor/Instruction Cache on miss and prefetch, experiment with prefetch on BB descriptor fill experiment with a larger cache and 3-cycle latency 256+ Kib, 24.5+ Kib * 0 fetches required if the previous 512‑bit fetch covers the current one |
|
L2 Fetch | ||
L2 Combined Descriptor/Instruction Cache |
8192 entry (512 KiB), 8‑way set associative,
520‑bit line, read, write, index: siaddr15..6, tag: siaddr63..16, filled from system interconnect or L3 on miss and prefetch, evictions to L3 2080+ Kib data, 192+ Kib tag |
|
Instruction Fetch Output |
32‑entry Instruction Queue of 80‑bit
decoded AR/XR
instructions 32‑entry Instruction Queue of 96‑bit decoded SR/BR/VR/VM instructions (16‑bit, 32‑bit, 48‑bit, and 64‑bit instructions expanded to canonical formats) |
|
AR/XR (Early) Execution Unit |
||
PC, CSP | Committed values | |
Register renaming for ARs | 16×6 4‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical AR numbers and assigning d from AR free list. | |
Register renaming for XRs (and CSP?) | 16×6 8‑read, 4‑write register file mapping 4‑bit a, b, fields to physical XR numbers and assigning d from XR free list. | |
Register renaming for BRs | 16×6 6‑read, 2‑write register file mapping 4‑bit a, b, c fields to physical BR numbers and assigning d from BR free list. | |
Register renaming for SRs | 16×6 8‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical SR numbers and assigning d from SR free list. | |
Register renaming for CARRY |
3‑bit register for 1→8 mapping, 8‑bit bitmap of free entries for allocation |
|
(VRs/VMs are not renamed) | ||
AR physical register file | 128×144 (+ parity) 6‑read, 4‑write | |
XR physical register file | 128×72 (+ parity) 6‑read, 4‑write | |
Segment Size Cache |
for segment bounds checking: 128 entry, 4‑way set associative, parity protected, mapping lvaddr63..48 and ASID to 6‑bit segment size log2 ssize and 2‑bit G0 for eight segments (one L2 TLB line), index: lvaddr55..51 per way tag: 20 bits (12‑ASID and lvaddr63..56), per way data: 64 bits indexed by lvaddr50..48 filled from L2 Data TLB 8+ Kib data, 2.5+ Kib tag |
|
L1 Data TLB |
512 entry, 8‑way set associative,
640‑bit line (8 SDE/PTEs), mapping lvaddr63..12 to siaddr63..12 in parallel with L1 Data Cache, line index: lvaddr14..12, set index: lvaddr17..15, tag: lvaddr63..18, data: siaddr63..12, XWR, R1/R2/R3, etc. (80 bits), filled from L2 Data TLB with 640‑bit read 40+ Kib data, 1.5+ Kib tag |
|
L2 Data TLB |
2048 entry, 8‑way set associative,
640‑bit line (8 SDE/PTEs), line index: lvaddr14..12, set index: lvaddr19..15, tag: lvaddr63..20, data: siaddr63..12, XWR, R1/R2/R3, etc. (80 bits), filled from L2 Data Cache with 512‑bit read and augmented with SDE bits 160+ Kib data, 5.6+ Kib tag |
|
L1 Data Cache |
512 entry (~36 KiB), 4‑way set associative,
576‑bit line, 144‑bit read, 576‑bit refill, index: lvaddr12..6, tag: siaddr63..13, write-thru, filled from L2 Data Cache on miss or prefetch 288+ Kib data, 25.5+ Kib tag |
|
Return Address Stack Cache |
8#8209;entry, fully associative, 576‑bit line size, fill and writeback to L2 Data Cache, subset and coherent with L2 Data Cache tag: siaddr63..6 + 3‑bit ring, 4.5+ Kib data, 488+ bit tag |
|
L2 Data Cache |
32768 entry (~2.25 MiB), 8‑way set associative,
576‑bit line, read, write, index: siaddr17..6, tag: siaddr63..18 + state, write-back, used for SR/VR/VM load/store and L1 Data Cache misses, filled from system interconnect or L3 on miss or prefetch, eviction to L3 18+ Mib data, 1.5+ Mib tag |
|
L2 Data Cache Prefetch | TBD, possibly based on Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Hardware Prefetch (16.7 KiB). | |
AR Engine Output | 64‑entry BR/SR/VR/VM operation queue | |
BR/SR/VR/VM (Late) Execution Unit (tends to run about a L2 Data Cache latency behind the AR Execution Unit) |
||
BR physical register file | 64×1 6‑read, 2‑write | |
SR physical register file | 128×72 (+ parity) 8‑read, 4‑write | |
CARRY physical register file | 8×64 (+ parity) 1‑read, 1‑write | |
VR register file |
16×72×128 (+ parity) 4‑read, 2‑write 144+ Kib |
|
VM register file |
16×128 (+ parity) 3‑read, 1‑write 2048 bits |
|
Combined Fetch/Data | ||
System virtual address TLB |
256 entry, 8‑way set associative, 640‑bit line (8 RDE/PTEs), mapping system virtual addresses to system interconnect addresses (maintained by hypervisor) line index: lvaddr14..12, set index: lvaddr16..15, filled from L2 Data Cache with 512‑bit read, expanded with RDE bits sized small because large page sizes expected 160+ Kib data, 12+ Kib tag |
|
L3 Eviction Cache serving multiple processor L2 Instruction and L2 Data caches |
262144 entries (~18 MiB), 8‑way set associative,
576‑bit line size, non-inclusive, index: siaddr20..6, tag: siaddr63..21 + state, write-back, plus 8‑way set associative directory for sub caches, filled by evictions from L2 Instruction and Data caches 144+ Mib data, 11.5+ Mib tag, 11.5+ Mib directory |
Using a line size in TLBs is unusual, but could represent a performance boost, given that the L2 data cache read is going to supply a whole line anyway. Without the line size, the L1 TLBs would only contain 32 or 64 entries for critical path reasons, and this is quite small. The issue is second level translation and svaddr protections. Performing these lookups for 8 PTEs would slow the TLB refill, so I expect the example microarchitecture to mark 7 of the 8 PTEs as requiring secondary checks and continue. On a match to an entry that requires secondary checks, these would be performed then, and the entry updated.
For tracking Basic Blocks (BBs) in the pipeline, there would be a 8‑bit internal basic block counter bbc (independent of the larger BBcount[ring] counters) incremented on each BB descriptor fetch. bbc6..0 would be used as index to write basic block information into a 128‑entry circular buffer for basic blocks in the pipeline, including the BB descriptor address, the prediction to check (including the conditional branch taken/not-taken, loop back prediction, and full target descriptor address for indirect jumps and returns), and so on. The circular buffer entry would also include a bit mask of completed instructions, and the entry may only be overwritten when all instructions are completed. Completion of all instructions of the BB causes state updates to commit (e.g. PC, CSP, and call stack writes).
Basic block ordering is tested testing the sign bit of subtraction: BBx is after BBy if (bbcx − bbcy) > 0. Each instruction in the pipeline includes its bbc value and the offset in the basic block. When a misprediction is detected, all instructions with bbc values after the basic block with the misprediction (using the above test) are flushed from the pipeline, bbc is reset to the bbc value of the mispredict plus one, and basic block descriptor fetch is restarted using the correct next descriptor (e.g. PC+8 for a not-taken conditional branch or the target calculated from the targr and targl fields, or the JMPA/LJMP/LJMPI/SWITCH destination for an indirect jump). Whether the circular buffer stores the targr/targl values or refetches them is TBD. 128 basic block predictions may seem large, but with the SecureRISC loop count prediction, 100% accuracy might be achieved, which means the 128‑entry circular buffer supports 128 loop iterations, and each loop iteration might only be three or four instructions. Note that in SecureRISC, there are 0, 1, or 2 predictions to check per basic block (e.g. a conditional branch and indirect jump, e.g. for a case dispatch), so 0, 1, or 2 mispredicts are possible (i.e. there might be two flushes).
I expect that immediately after each 512‑bit read of the L1 instruction cache, the start mask from the Basic Block (BB) descriptor will be used to feed the specified bits to eight parallel decoders which will convert them to a canonical form, something along the lines of the following. These canonicalized instructions would then be put into queues for the early pipeline (e.g. operations and branches on XRs and loads to and stores from ARs/XRs), late pipeline (SRs/BRs/VRs/VMs, or both of these (e.g. for loads to and stores from SRs/VRs/VMs and moves between early and late).
79 | 24 | 23 | 22 | 21 | 17 | 16 | 12 | 11 | 7 | 6 | 0 | |||||
i | sa | b | a | d | op80 | |||||||||||
56 | 2 | 5 | 5 | 5 | 7 |
95 | 42 | 41 | 38 | 37 | 32 | 31 | 26 | 25 | 20 | 19 | 14 | 13 | 0 | |||||||
i | e | c | b | a | d | op96 | ||||||||||||||
54 | 4 | 6 | 6 | 6 | 6 | 14 |
The following list is in no particular order.
areasto
arenas? Jemalloc uses the
arenaterminology.
quad precisionis 128 bits, and quadword on SecureRISC would be 288 bits, which would be confusing. Or add a block transfer between SRs and VRs?
Width | Name | Mnemonic | Format | Extended Format | |||||
---|---|---|---|---|---|---|---|---|---|
S | E | P | S | E | P | Total | |||
128 | binary128 | Q | 1 | 15 | 113 | 1 | 19 | 141 | 160 |
64 | binary64 | D | 1 | 11 | 53 | 1 | 15 | 64 | 80 |
32 | binary32 | F | 1 | 8 | 24 | 1 | 11 | 29 | 40 |
16 | binary16 | H | 1 | 5 | 11 | 1 | 8 | 12 | 20 |
bfloat | B | 1 | 8 | 8 | |||||
8 | binary8p3 | P3 | 1 | 5 | 3 | 1 | 5 | 5 | 10 |
binary8p4 | P4 | 1 | 4 | 4 | |||||
binary8p5 | P5 | 1 | 3 | 5 |
for freeand having to potentially save/restore explicitly additional registers in function prologues and epilogues. Of course, it may end up saving more than required, which is potential performance hit. How might this work? The AR XRs register files would each architecturally have 32 registers before renaming with a 3‑bit WindowBase CSR to provide the base of the 16-entry window into these files. Thus, the architectural register for XR[a] would actually be XR[(WindowBase4..2 + 0∥a3..2)∥a1..0]. Calls would increment WindowBase by 4, in effect saving 4 of the caller’s registers in each register file (8 total): a0-a3 and x0-x3 and creating fresh a12-a15 and x12-x15 for the callee. Unlike Xtensa, SecureRISC should do overflow and underflow without using exceptions. This definition allows five call levels in 32 architectural registers of each register file. For example, when all 32 architectural registers are live, the processor begins saving four ARs and four XRs to the stack, reducing the number live to 28 in anticipation of the next call. Similarly, if the number live drops to 20, the processor begins loading registers in anticipation of the next return. The reason to target 32 architectural registers rather than 64 is to keep the post-renaming register files smaller, since these files need to hold all architectural registers plus registers for all registers being written by instructions scheduled in hardware. Thus, this is more of a code size feature than performance feature because there is relatively little hysteresis to significantly reduce saves and restores.
Tag | Use |
---|---|
0 | Nil/Null pointer |
1..31 | Sized pointers exact |
32..127 | Sized pointers inexact |
128..191 | Reserved (possible sized pointer extension) |
192..199 | Unsized pointers with ring |
200..207 | Code pointers with ring |
208..220 | Reserved |
221 | Pointer to blocks with header/trailer sizes |
222 | Cliqued pointer in AR |
223 | Segment Relative pointers |
224 | Lisp CONS |
225 | Lisp Function |
226 | Lisp Symbol |
227 | Lisp/Julia Structure |
228 | Lisp Array |
229 | Lisp Vector |
230 | Lisp String |
231 | Lisp Bit-vector |
232 | CHERI-128 capability word 0 |
233 | Reserved |
234 | Lisp Ratio, Julia Rational |
235 | Lisp/Julia Complex |
236 | Bigfloat |
237 | Bignum |
238 | 128‑bit integer |
239 | 128‑bit unsigned integer |
240 | 64‑bit integer |
241 | 64‑bit unsigned integer |
242 | Small integer types |
243 | Reserved |
244 | Double-precision floating-point |
245 | 8, 16, and 32‑bit floating-point |
246..249 | Reserved |
250 | Size header/trailer words |
251 | CHERI capability word 1. Bits 143..136 of AR doubleword store (used for save/restore and CHERI capabilities) |
252 | Basic Block Descriptor |
253 | Reserved for packed Basic Block Descriptors |
254 | Trap on load |
255 | Trap on load or store |
Earl Killian <webmaster at securerisc.org> | |||
2024-04-11 |