SecureRISC0 Instruction Set Architecture

Up Front

This is an older set of ideas that I later evolved into SecureRISC. For the moment I am calling this older ISA SecureRISC0. It will eventually go away; at the moment I am keeping it for my reference purposes only. If you got here via a search engine, I suggest you follow the link above to look at the current stuff instead.

Documentation Outline

This document is organized as successive expositions at increasing levels of detail, to give the reader an idea of the motivations and high-level differences from conventional processor architectures, eventually getting down to the detailed definitions that direct SecureRISC processor execution. So if the introductory material seems a little vague, that is because it attempts to sketch an overall context into which the details are later fit.

Work In Progress Documentation

This document was created to develop old ideas and notes of mine. It is not a complete Instruction Set Architecture (ISA), but only the things I have had time to consider and work on. In addition, while learning about CHERI, a new option I had not previously considered occurred to me, and I created a variant called SecureRISC to develop those ideas in parallel, but that document is only barely started. For the time being, neither SecureRISC0 or SecureRISC is anything but a point of discussion.

The ISA is mostly just ideas at this point. The opcode assignments, instruction specifications are not more than hints. The Virtual memory architecture needs work.

Open Source

SecureRISC0 began as an exploration of what a security-conscious ISA might look like. Should it or SecureRISC someday turn into something more than an exploration, my intent would be to make it an Open Source ISA, along the lines of RISC‑V.

Software

There is no software (compiler, operating system, etc.) for SecureRISC0. This is a paper-only spec at this point in time.

SecureRISC0 Instruction Set Architecture

Introduction

SecureRISC0 is my attempt to explore my thoughts on a security-conscious Instruction Set Architecture (ISA) appropriate for server class systems, but which with modern process technology (e.g. 5 nm), could even be used for IoT computing given the die area for such a single such processor is a small fraction of one mm². I start with the assumption that the processor hardware should enforce bounds checking and that the virtual memory system should use older, more sophisticated security measures, such as found in Multics, including rings, segmentation, discretionary and non-discretionary access control. I also propose a new block structured instruction set that allows for better Control Flow Integrity (CFI) and performance.

I feel a comment about Multics is appropriate here. There seems to be an impression among many in the computer architecture world that many Multics features are complex. They are actually simple and general. Computer architecture from the 1980s to present is often an oversimplification of Multics. For example, segmentation in Multics served primarily to make page tables independent of access control, which is a useful feature that has been mostly abandoned in post-1980 architectures. Pushing access control into Page Table Entries (PTEs) puts pressure to keep the number of bits devoted to access control minimal, when security considerations might suggest a more robust set. As another example, RISC‑V has two rings (User and Supervisor), with a single bit in PTEs (the U bit) serving as a ring bracket. Having only two rings means a completely different mechanism is required for sandboxing rather than having four rings and slightly more capable ring brackets. It is true that rings were not well utilized on Multics, but we now have more uses for multiple rings.

Goals

The goals for SecureRISC0 in order of priority are:

Security
Performance
Power efficiency
Compatibility where not in conflict with the above
Code size (primarily for performance)
Support for Garbage Collection (GC)
Support for languages with dynamic typing (e.g. Lisp, Python, Julia)

Non-goals for SecureRISC0 include (this list will probably grow):

Support for low-end processors:
- 32‑bit datapaths and addresses
- Omitting hypervisor support
- Omitting memory translation
- Omitting security functionality
- Omitting floating-point and vector

Security can mean many things. One of the most important is preventing unassisted infiltration (e.g. through exploiting buffer overflows, use-after-free errors, and other programming mistakes). Another is preventing unintentionally-assisted infiltration (e.g. phishing attacks installing trojans), which may be accomplished through non-discretionary access control. SecureRISC0 is not a comprehensive attempt at security, but addresses the aspects that I think can be improved.

While I expect that non-discretionary access control is critical to computer security, at this point there is relatively little in SecureRISC0’s architecture that enforces this (it is primarily left to software). However, I am still looking for opportunities in this area.

Synergy Between Security and Other Goals

Security, garbage collection, and dynamic typing may appear to be orthogonal, but I see them as synergistic. SecureRISC0 attempts to minimize the impact of programming mistakes in several ways, such as making bounds checking somewhat automatic, making compiler-generated checking more efficient. To address memory allocation error detection however, other techniques are necessary. One possibility is garbage collection (GC), which eliminates these errors, but GC needs to be efficient for this application, hence the goal synergy. Another way to detect some allocation errors is tagging memory so that use after free is detected (unfortunately use-after-reallocation may not be detected with this mechanism). SecureRISC0 targets these goals by what will likely be the most controversial aspect of SecureRISC0: tags on words in memory and registers. The Basic Block descriptors may be more unusual, but I think the reader will come to appreciate this aspect of SecureRISC0 with familiarity (especially given the Control Flow Integrity advantages as a security feature), but the reader may in the end not find memory tags convincing because of the non-standard word size that results. I do not see an alternative, however. Tags simultaneously provide a mechanism for bounding pointers, support use-after-free detection and more efficient Garbage Collection (the best solution to allocation errors), and also happen to support dynamically typed languages.

Limitations of SecureRISC0 Bounds Checking

SecureRISC0’s pointer bounding is however not as general as I would like; it is suited to situations where indexing from a base is used rather than incrementing and decrementing pointers, and so the SecureRISC0 variants are better suited to languages such as Rust, Swift, Julia, Python, or Lisp. Running some C++ code would be possible with bounds checking, but pointer-oriented C++ code would fail bounds checking. I have reserved a tag for C++ pointers, but using these would represent a less secure mode of operation. The supervisor would need to enable on a per-process basis whether C++ pointers can be used; if disabled they would cause exceptions. For example, a secure system might only allow C++ pointers for applications without internet connectivity.

SecureRISC0 does have support for CHERI capabilities, which has been demonstrated to be fairly compatible with C++, at the cost of making pointers two words. There is much more about CHERI in subsequent sections.

Background

The original motivation for block-structured ISAs was Instruction Level Parallelism (ILP) studies that I did back in 1997 at SGI that showed that instruction fetch was the limiting factor in ILP. This was before modern branch prediction, e.g. TAGE, so that result may no longer be true. The idea was that instruction fetch is like linked list processing, with parsing at each list node to find the next link. I wanted to replace linked lists with vectors, but couldn’t figure out how, and settled for reducing the parsing at each list node. I still feel that this is worthwhile, but the exact tradeoffs might require updating older work in this area. The best validation of this dates from 2007, when Professor Christoforos Kozyrakis convinced his PhD student Dr. Ahmad Zmily to look at this approach in a PhD thesis. In the introduction of Block-Aware Instruction Set Architecture Dr. Zmily wrote, We demonstrate that the new architecture improves upon conventional superscalar designs by 20% in performance and 16% in energy. Such an advantage is not enough on which to foist a new ISA upon the world, but it encourages me to think that it does provide impetus for using such a base when creating a new ISA for other purposes, such as security.

Prior to starting SecureRISC0, my previous experience was with the many ISAs and operating systems. Long after starting my block-structured ISA thoughts, I became involved in the RISC‑V ISA project. RISC‑V is in many ways a cleaned-up version of the MIPS ISA (e.g. minus load and branch delay slots) and it seems likely to become the next major ISA after x86 and ARM. Being Open Source, RISC‑V is has easy to access documentation. As such I have used it for comparisons in the current description of SecureRISC0 and modified some of my virtual memory model to be slightly more RISC‑V compatible (e.g. choosing the number of segment bits to be compatible with RISC‑V Sv48). However, most aspects of the SecureRISC0 ISAs predate my knowledge of RISC‑V and were not influenced by it, except that I found that RISC‑V’s Vector ISA was more developed than my own thoughts (which were most influenced by the Cray-1, which supported only 64‑bit precision).

In 2022 I encountered the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) research effort. I found their work impressive, but I had concerns about the practicality of some aspects. Despite my concerns, I thought that SecureRISC0 might be a good platform for CHERI, so I have extended SecureRISC0 to outline how it might support CHERI capabilities as an exploration. The SecureRISC variant incorporates a new sized pointer format based on ideas from CHERI. This sized pointer is not as capable as a CHERI pointer, but it is 64 bits rather than 128 bits, which has the advantage of size. There is a more detailed discussion of CHERI and SecureRISC0 below.

Conventional Aspects of SecureRISC0

Some things remain unchanged from other RISCs. Addresses are byte addressed. Like other RISC ISAs, SecureRISC0 is is mostly based upon loads and stores for memory access. Integers and floating-point values have 8, 16, 32, or 64‑bit precision. Floating-point would be IEEE-754-2019 compatible. The Vector ISA will probably be similar to the RISC‑V Vector ISA, but might however use the 48‑bit instruction format to do more in the instruction word and less with vset. Also, there are four explicit vector mask registers, rather than using v0.

Readers will have to decide for themselves whether the proposed virtual memory is conventional because it is somewhat similar to Multics, or unconventional, because it is different from RISC ISAs of the last forty years. A similar comment could be made concerning the register architecture, since it echos an ISA from 1976, but is somewhat different from RISCs since the 1980s.

Unconventional Aspects of SecureRISC0

Much more in SecureRISC0 is unconventional. To prepare the reader to put aside certain expectations, we list some of these things here at a high level, with details in later sections.

Programming Model
- SecureRISC0 is oriented towards array indexing rather than pointer arithmetic. An array pointer provides the address of element 0 and the size of the array. The array reference a[i] loads or stores to that location only after checking that i is within the bounds specified in the array pointer. C++ *p++ sort of programming is less amenable to SecureRISC0 bounds checking and is not the intended target of this ISA.
Control Flow
- The conventional instruction stream is split into Basic Block (BB) descriptors that carry most of the control flow specification of the program, leaving the instructions basically specifying loads, stores, operations, and checking the predictions made in the control flow front-end. This has a number of performance and security advantages and in particular facilitates Control Flow Integrity (CFI) checks quite naturally, and defeats much Return Oriented Programming. The BB descriptors get their own cache, which replaces the dynamically created Branch Target Buffer (BTB) found in other microprocessors. Advantages of this approach are outlined in the next section.
- Conditional branches, indirect calls and jumps exist in the instruction stream to check the predictions done in the front-end of the pipeline, but they may occur anywhere in the basic block (not just as the last instruction), which reduces the misprediction penalty. The branch determines what happens at the end of the basic block, not to the next instruction, which is why it can occur anywhere. Conditional branch instructions (as opposed to descriptors) do not include a target address, only the taken/not-taken decision; the target address is in the basic block descriptor. Jump instructions may include a memory read, as in a load. Calls, returns, and jumps can be conditional.
- Variable length instructions are supported in a fashion that still allows efficient parallel decode.
- There is inner loop iteration prediction support to avoid some branch misprediction on loops such as for i ← a to b (where the loop iteration count is b − a + 1) and for i ← a to b step -1 (where the loop iteration count is a − b + 1). The loop may be exited early with a conditional branch; only the loop back is predicted with the hint.
- For Control Flow Integrity (CFI), the return address stack is managed in a protected manner by the processor, and is separate from the program data stack. This stack is expected to be implemented with a specialized cache rather than as prediction-only structure. This is a tentative feature and not central to the SecureRISC0 idea, but simply a response to one of the major exploits of modern systems, overwriting return addresses on the program stack. If this problem is completely addressed by bounds checking, the return address stack is not needed. This feature is also motivated by modern processors having special return address stack hardware for prediction purposes, and so having hardware push/pop is not really new.
Tagged Pointers
- Pointers and data are tagged, and tags are checked. The pointer tags supply the size of the memory addressed, or indicate it is stored relative to the pointer. Using the pointer as a base with an index checks the index against the size. Tags are stored in memory. Address registers are wide enough to hold the pointer, tag, and full size when that size is loaded from the pointer.
- Tags on words in memory and registers may be the most controversial aspect of SecureRISC0. The Basic Block (BB) descriptors may be more unfamiliar, but I think the reader will come to appreciate this aspect of SecureRISC0, but the reader may in the end not find memory tags worth the non-standard word size. Tags in SecureRISC0 achieve several goals, none of which might be sufficiently motivating by itself, but I find, viewed in combination, bounds checking, support for Garbage Collection (itself a security advantage), and support for dynamically typed languages justify the non-standard word size. Tags is also the area of the ISA that is most in flux. In particular, SecureRISC is a variation that I am exploring that differs primarily in tagging.
- SecureRISC0 defines words in registers and memory to be 72 bits, which consists of 8 bits of tag and 64 bits of data. In some ISAs a word is 16 bits and in many others it is 32 bits, in the Cray ISA it is 64, but here it is 72.
- SecureRISC0 defines one tag value that causes an exception when the memory location is loaded from or stored to. This tag has several uses. One is to detect references to uninitialized memory and dynamically allocated memory use after being freed (but does not detect such stores unfortunately). Another use is for dynamic linking.
- A tentative goal of SecureRISC0 is to support the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) Project programming model, while not requiring it. Two tags are dedicated to CHERI 128‑bit capabilities; these reserved tag values replace the CHERI tag bit. More on this follows in a later section.
Register Files
- The trend in recent ISAs has been toward large (e.g. 32) homogeneous register files, but SecureRISC0 opts for multiple smaller (16) files of greater heterogeneity for several reasons. One reason is specialization (similar to the reason recent ISAs still have separate integer and floating-point files), and another is simplify layout and wiring in very wide issue implementations by allowing separate files with fewer read and write ports to be placed nearer the functional units that operate on their values. This places additional burden on the compiler, but the suitability of the ISA to very wide issue implementations makes this burden shifting appropriate.
- The usual integer and floating-point register file distinction is replaced by address registers, index registers, and scalar registers, with the scalar registers able to hold both integer and floating-point values, and the address and index registers specialized for bounds checking. Integer arithmetic involved in creating addresses is done in the address and index registers, and other integer arithmetic is done in the scalar registers. SIMD registers are replaced by vector registers.
  - Some micro-architectures may use a simpler mechanism than full Out-of-Order (OoO) instruction scheduling to tolerate cache miss latency in the scalar registers, providing higher performance for mid-range implementations.
- In addition to address and scalar register files, there is also a vector register file, as in the Cray-1 and RISC‑V, replacing the SIMD registers found in earlier RISC ISAs. Vectors are both the ancient and modern alternative to SIMD. Ancient in that processors such as the Cray-1 (1976) offered a vector ISA using vector registers, and modern (as in RISC‑V) because they are potentially more efficient and scalable. Vectors ISAs are however somewhat complex because they have to package up all the common idioms that might be composable by a compiler using scalar instructions.
- Recent ISAs have had 32 integer registers 32 floating-pointer registers. Because of reorganization into Address, Index, Scalar, and Vector, SecureRISC0 specifies only 16 of each. For pure integer scalar algorithms, this is 48 registers, rather than 32. For floating-point algorithms, this is the same 32 for address calculations but only 16 for scalar floating-point computations. However, I expect that most floating-point algorithms will be able to employ the vector register file in many cases, and this will be a better overall mix of resources for these algorithms.
Virtual Memory
- Pointer addresses are translated using one level of segmentation with a choice of direct mapping or paging as specified by the supervisor. Segments are the primary unit of access control and sharing and are bounds checked. When paged, TLB miss table walks use segment size to determine the number of page table levels. Small segments (e.g. code segments) therefore have a reduced TLB miss penalty. Direct mapping along with the segment size feature is particularly useful for I/O regions of the address space.
- Support for hypervisors is not an add-on but built-in from the start. The hypervisor specifies a second level of either direct mapping or paging, which allows it to page the memory used by the supervisor. There is no pretense that the supervisor is running on bare hardware; rather than virtualize privileged operations, SecureRISC0 has the supervisor explicitly call the hypervisor for certain services.
- A pointer that is relative to the segment base is supported, which allows such pointers to appear in files that are mapped to different locations in different address spaces. Such local pointers may be 64 or 32 bits in width, and the 32‑bit local pointers may reduce memory footprints in some circumstances.
- Page tables for memory mapped into multiple address spaces are sharable, even when these address spaces have different access permissions because permissions are generally kept per-segment rather than per-page (though per-page permissions also exist for backward compatibility and can further reduce segment permissions if used).
- New permissions are introduced in addition to the standard Read, Write, and Execute (R, W, and X). Pointer permission (P) allows segments to contain pointers to other segments, and Capability permission (C) allows segments to contain CHERI capability pointers.
- I use new terminology for what are often called virtual and physical addresses (where physical address is often split into guest physical address and system physical address): local virtual addresses, system virtual addresses, and system interconnect addresses. The mapping of system virtual addresses to system interconnect addresses is system-wide, and is used by both processors and I/O. This mapping also provides unconventional attributes and permissions (e.g. beyond RISC‑V’s Physical Memory Attributes or PMAs), such as whether the memory is tagged or not and whether it supports CHERI capabilities or not. Loading from non-tagged memory automatically synthesizes an integer tag as the data is supplied to the system interconnect fabric.
Modes
- The concepts of user and supervisor modes are generalized to an older, more sophisticated model called rings that allows for both improved performance and sandboxing (e.g. running code downloaded from the internet in browsers in a safe fashion). Ugly kludges have been introduced into other recent ISAs to implement sandboxing when the older method is actually simpler and performs better. SecureRISC0 has 7 rings numbered 0-6. Ring numbers are inverted from Multics; ring 0 is the least privileged and ring 6 is the most privileged. An example of ring usage is as follows. Ring 6 might be something analogous to RISC‑V’s Machine Mode. The hypervisor might run in ring 5 with its device drivers in ring 4, guest supervisors might run in ring 3 with their device drivers in ring 2, and a user-mode browser might run in ring 1 and run Javascript JIT code in ring 0. Ring 7 is not implemented so that this value can serve other purposes.
Other
- Stacks grow upward, not downward. This allows them to better take advantage of segment bounds checking, and to more efficiently grow from N to N+1 levels of page tables when needed. (If downward stacks are deemed necessary, a bit in the segment descriptor word could be added, but I don’t see the need.)
- SecureRISC0 more efficiently supports dynamically typed languages, such as Lisp, Python, and Julia.
- SecureRISC0 more efficiently supports generational garbage collection using a few extra bits in segment descriptors and page table entries.
- Many instructions handle situations by causing an exception, and compiler-inserted error detection is supported, expected, and encouraged. Recent RISCs have somewhat moved away from exceptions except on loads, stores, and jumps, but SecureRISC0 moves in the opposite direction, increasing support for runtime exceptions for the purpose of security. The compiler is encouraged to employ trap instructions that check assertions, bounds, and so forth, at runtime. The exception mechanism is not yet defined, but I would like user rings to be able to handle many of their own exceptions.

Advantages of the Basic Block Descriptor

The Basic Block (BB) descriptor aspect listed above is perhaps the most unfamiliar. To help motivate it for the reader, below are some of the rationale and advantages of this approach.

Contemporary processors have various structures that are created and updated during program execution to improve performance, such as Caches, TLBs, Branch Target predictors (BTB), Return Address predictors (RAS), Conditional Branch predictors, Indirect Jump and Call predictors, prefetchers, and so on. In SecureRISC0 one of these is moved into the ISA for performance and security. In particular the BTB becomes a Basic Block Descriptor Cache (BBDC). The BBDC caches lines of Basic Block Descriptors that are generated by the compiler, in a separate section from the instructions. I have also sought to the Return Address predictor more cache-like, and build in some additional ISA support for loop prediction.

Fetch prediction operates solely on the BB descriptors (as cached in the BBDC), which are fixed size and need not be parsed, unlike the instruction stream. This allows for much more efficient operation. The BBDC is filled a line at a time just like the Instruction Cache (with prefetch as well), and locality improves performance here too.
In SecureRISC0, it is much easier to rearrange instructions in memory to group hot blocks together to improve instruction cache miss rates and make prefetching more effective. In conventional ISAs, one basic block ending in a conditional branch falls through to one of its two successors. If the branch is usually taken, the fall-through code may be fairly cold. To move what would naturally be the fall-through code out of line introduces an extra unconditional branch at its end to get back after it is executed. In SecureRISC0, BB descriptors fall through to subsequent descriptors, but each has a pointer to the instructions to fetch, and so the instruction blocks of a bage could simply be sorted by frequency, placing the hottest first and the coldest last, or some similar arrangement^*, all without introducing new instructions or changing anything other the pointers in the descriptors.
* For example, one might want to use place hot code differently on even and odd pages to reduce set conflicts.
The freedom to place instructions separately from BB descriptor order allows for code sharing, thereby reducing instruction cache misses. One copy of identical instruction sequences can be stored and shared between descriptors within a given bage. This is most likely to occur for register restore in function epilogues, but it would be interesting to see how often it occurs elsewhere. This is more likely to occur if register save stack locations are standardized by the compiler, and if stack frames are deallocated without reference to the stack frame size.
The size of the instruction stream is reduced slightly by removing branch targets, making the L1 instruction Cache (which does not contain BB descriptors) slightly more effective, but this is probably not significant. More significantly, it also reduces the penalty of having a larger L1 Instruction Cache with an additional cycle of latency, which if exploited would potentially improve performance. The translation performed by the L1 Instruction TLB during the BB Descriptor Cache access also supplies the translation required for the L1 Instruction Cache, and that access no longer affects the timing of the instruction cache. It is the latency of the BB descriptor cache that becomes critical to performance, since this is the important recurrence for fetch. Latency in the instruction cache affects the misprediction penalty, but is not particularly timing critical.
The size of the BBDC is critical to performance, similar to the the criticality of the BTB size in other architectures. I expect the BBDC to be of size similar to the L1 Instruction Cache in other architectures, which would cache significantly more descriptors since it does not contain instructions and having an improved hit rate because accesses are dense (BTBs are accessed with more random instruction addresses).

Capability Hardware Enhanced RISC Instructions (CHERI)

I started with the assumption that pointers are a single word, which are expanded based on the 8-bit tag to a base and size when loaded into the doubleword (144‑bit) Address Registers (ARs). The address and size supports pointers giving the base address and size, where indexing using the pointer checks the index value against the size. This supports programs oriented toward a[i] pointer usage, but not C++ *p++ pointer arithmetic. In contrast, the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) Project started with the assumption that capability pointers are four words (including lower and upper bounds, the pointer itself, and permissions and object type), and invented a compression technique to get them down to two words. SecureRISC0 can support CHERI by using its 128‑bit AR load and store instructions to transfer capabilities to and from the 144‑bit ARs, and therefore be able to accommodate either singleword or doubleword pointers. Support for the CHERI bottom and top decoding, its permissions, and its additional instructions would be required. The CHERI tag bit is replaced with two SecureRISC0 reserved tag values (one tag value in word 0, another in word 1). I would expect languages such as Julia and Lisp would prefer singleword pointers, so supporting both singleword and doubleword pointers allows both to to exist on the same processor depending on the instructions generated by the compiler.

Unlike CHERI, SecureRISC0 pointers have only a size, and not bottom and top values encoded. As a result both SecureRISC0 and SecureRISC are more suited to situations where indexing from a base is used rather than incrementing and decrementing pointers, and so the SecureRISC0 variants are better suited to languages other than C++, primarily ones that emphasize array indexing over pointer arithmetic. My expectation is that running some C++ code on would be possible with bounds checking, but pointer-oriented C++ code would fail bounds checking. I suspect it would be a better target for Rust, Swift, or Julia. I have reserved a tag for C++ pointers, but using these would represent a less secure mode of operation. The supervisor would need to enable on a per-process basis whether C++ pointers can be used; if disabled they would cause exceptions. For example, a secure system might only allow C++ pointers for applications without internet connectivity.

SecureRISC0 and CHERI Variants

Tagged memory words are separable from other aspects of SecureRISC0, such as the Multics aspects and the Basic Block descriptor aspects. One could imagine a version of SecureRISC0 without the tags and a 64‑bit word (72 bits with ECC in memory). Even in such a reduced ISA—call it SemiSecureRISC—I would keep the AR/XR/SR/VR model. SemiSecureRISC is still interesting for its performance and security advantages, but I do not plan to explore it at this time. There is also the possibility of combining SemiSecureRISC with CHERI and its 1‑bit tag, since the CHERI project has done a lot of important software work. Call such an ISA BlockCHERI. I suspect the CHERI researchers would say that the only advantage in BlockCHERI would be the performance advantage of the Block ISA and the AR/XR/SR separation, with the ARs specialized for CHERI capabilities, and the XRs/SRs for non-capability data. My primary thought on the BlockCHERI is that the difference between a 65‑bit memory (73 bits with ECC) and 72‑bit memory (80 bits with ECC) may find that 7 extra bits may be put to good use.

One could imagine variants of SecureRISC0 that have only some of its features:

Name	Block ISA	Segmentation	Rings	Tags	CHERI	Word	Pointer
SecureRISC0	✔	✔	✔	✔	✔	72	72/144
SemiSecureRISC	✔	✔	✔			64	64
BlockRISC	✔					64	64
BlockCHERI	✔	?	?		✔	65	130

As I indicated earlier, I don’t think that BlockRISC is sufficient in itself to justify a new ISA. I am concentrating on the full package.

Open Aspects

I need to think more carefully about I/O in a SecureRISC0 system. Certainly some I/O will be done in terms of byte streams transferred via DMA to/and from main memory (e.g. DRAM). Such I/O if directed to tagged main memory writes the bytes with an integer tag. Similarly if processors use uncached writes of 8, 16, 32, or 64 bits (as opposed to 8‑word blocks) to tagged memory, the memory tag must be changed to integer. Tag-aware I/O of 8‑word units exists and may be for paging and so forth. It may be that a general facility for reading tagged memory including the tag as stream of 8‑bit bytes could be provided along with a cryptographic signing, and for writing such a stream back with signature checking will be useful.

Ports onto the system interconnect fabric will have to have rights and permissions assigned by the hypervisor, and perhaps hypervisor guests. This needs to be worked out.

Being able to support user-mode I/O would be desirable, but it seems difficult to make this work, because then the user ring code would be sending its own local virtual addresses to the I/O device for DMA, and so the I/O devices would have to be able to translate user addresses to system interconnect addresses via two-level page tables and user-mode would have to tell the I/O device the page table the supervisor assigned it, which it doesn’t know. At the moment, I have left this unaddressed.

SecureRISC

SecureRISC0 is my first set of thoughts on this subject. I have been exploring a variation called SecureRISC that I am exploring. However, that document is barely different from this one at this point. The primary differences are:

Expansion of the virtual address space from 61 bits to 64 bits by moving the ring number into the tag.
Changes in pointer size encoding.
The LOOP count feature is changed.

SecureRISC0 and SecureRISC encode the size in different ways, and both have tradeoffs:

SecureRISC0 encodes sizes of 0 to 128 words in the tag directly, and stores the size for larger blocks in the word before the pointer. This makes it impossible to pass a slice of an array with a size larger than 128 words, and restrict the references to that slice.
SecureRISC encodes sizes in the upper address bits, and uses the tag to indicate how many bits are size bits and how many are address bits. This allows array slices to be passed, but large array slices become approximate, as the number of size bits decreases.

The SecureRISC pointer format under consideration would change to the following form:

SecureRISC Proposed Pointer Change
71	67	66	64	63	61	60	48	47			0
SS		ring		SG		segment		size	0	byte address
5		3		3		13		21−SS	2	25+SS

where the boundary between the byte address and size, and the scaling of the size value is determined by SS field for values 0 to 20 (tags 0 to 167). Values 21 is Reserved. Value 22 is tentatively assigned to pointers to regions with header/trailer size words. Value 23 (tags 184 to 191) of SS is used for pointers with no size field (the only size check comes from the segment descriptor size field). Values 24 to 29 (tags 192 to 239) of SS and the ring field are used for dynamic type tagging with a pointer in bits 63..0 or a pointer with a size implied by the tag (e.g. 2 words for the CONS tag). Values 30 and 31 are used for dynamic type tagging with non-pointer data in bits 63..0. The byte address occupies bits SS+24..0 and the size occupies bits 47..SS+26 and has value ptr_47..SS+26 ∥ 0^SS×2+3. For example:

SS	Address		Size		Granularity (bytes)
SS	bits	width	bits	width	Granularity (bytes)
0	24..0	25	47..26	22	8
1	25..0	26	47..27	21	32
︙
19	43..0	44	47..45	3	2⁴¹
20	44..0	45	47..47	1	2⁴³
21	Reserved
22	Reserved
23	47..0	48	n.a.

Documentation Conventions

Little Endian bit numbering is used in this documentation (bit 0 is the least significant bit). While not a documentation convention, I might as well mention up front that SecureRISC0 is similarly Little Endian in its byte addressing.

Subscripts are used for bit-field extraction: AR[a]₆₃ is bit 63 of AR[a], and SR[c]_7..0 is the least significant 8 bits of SR[c].
Superscripts on single-bit quantities are used for bit replication: 1¹⁶ represents a bit string of sixteen ones, but confusingly I also use 2^N to mean the Nth power of two (2⁸ = 256)—my excuse is that 2 is not a single bit value.
a∥b represents bit-string catenation.
Thus SR[a]_64−s..0 ∥ 0^s represents a left shift of SR[a] by s. Similarly, (SR[a]₆₃)^s ∥ SR[a]_63..s represents an arithmetic right shift of SR[a] by s.
Where necessary, operators are subscripted with a u for unsigned and s for signed. Thus <_u is unsigned less than, and >>_s is signed arithmetic shift.
The ×_u and ×_s multiplication operators produce a bit-string product of width equal to the sum of the widths of the two operands (e.g. the unsigned product of two 64‑bit numbers is 128 bits).
The +_u, +_s, −_u, and −_s operators produce a bit-string one bit wider than the widest operand, but the unscripted +, −, and × operators produce a result the same width as the widest operand. Finally the +_p operator is used for tagged addition, either pointer+integer producing pointer, or integer+integer producing integer.

Basic Terminology

Multics Terminology (Multicians may mostly skim)

Segment

Segments are the basic unit of access control and sharing, and are typically used for mapping files into the address space of processes. Segments may be paged or may be directly mapped (e.g. to an I/O device). Segments have power-of-two sizes that are used for bounds checking and for determining the depth of the page table walk when paging is used. (Multics segment lengths were not limited to powers of two sizes, but arbitrary lengths significantly increase the size of Segment Descriptors, whereas only six bits are required for powers of two.) Segment sizes < 12 (4096 bytes) would not be supported in most micro architectures, and the maximum segment size is 61 (2 EiB or exbibytes). Segment sizes > 48 (256 TiB or terabyte) require the Segment Descriptor Table (set by the supervisor) to have 2^size-48 entries with consistent values.
Segments and pages are also used to implement generational garbage collection (GC) by extending Segment Descriptors and Page Table Entries (PTEs) and TLB entries to have a generation number so that stores of pointers from an old generation to a new one are noted in PTE bits in a fashion similar to the usual Dirty bits. This allows pages that are about to be swapped out to be scanned for pointers to newer generations and that noted, so that these pages need not be swapped in during GC. (SecureRISC0 may need a way to disable this feature in PTEs to provide more bits to software.)

Ring

Rings provide Read, Write, Execute, and Gate permission on a nested basis for different layers of privilege. Many recent and simpler architectures provide only two layers of privilege (typically called User and Supervisor) with Read (R), Write (W), and Execute (X) permission bits a fourth bit (e.g. the RISC‑V U bit in PTEs) determining whether these permissions apply to both privilege levels or only the most privileged with access denied to the least privileged. Rings are the older generalization where separate Read, Write, and Execute permissions are specified for multiple privilege levels (typically 4 or 8) in a nested fashion that takes only a few more bits (compare RISC‑V’s 4 bits (RWXU) to 9 bits for four rings (RWX plus three 2‑bit ring brackets) and 12 bits for eight rings (RWX three 3‑bit ring brackets). In addition to Execute permission, rings allow Multics and SecureRISC to implement Gate permission for privilege transition on procedure call to and return from gates. With 3‑bit ring numbers, a further savings is possible by using one value to encode the RWX permissions. For example, by defining R = R2≠7, W = R1≠7, X = R3≠7, only 9 bits are required for 8 rings compared to RISC‑V’s 4 bits for 2 rings.
Rings enforce a layering upon what may be read or written by code on a per-segment basis, with ring 0 being the least privileged and ring 6 being the most privileged. (SecureRISC inverts the ring number to privilege mapping chosen by Multics. This allows less privileged code to be unaware of higher privilege levels and the number of rings supported by an implementation to vary: some implementations might have less than 7 rings.) Less privileged rings may also call gates in more privileged rings to request services from those rings. Ring numbers are stored in pointer tags so that pointer parameters passed to more privileged rings result access to virtual memory using the access rights of the caller, and not the rights of the more privileged ring. Loading a pointer from memory sets the ring field to the minimum of the current ring of execution, the ring number in the base register of the load, and the ring number stored in memory. A special instruction used in gates is applied to pointers passed in registers to apply this minimum calculation using the ring number of the caller.
The number of rings could be reduced from 8 to 4 or even just 2 in some implementations, though the savings from this is minimal. Perhaps in a four ring system ring 2 would be for the operating system, ring 1 for user code, and ring 0 for sandboxed user code.
Many recent ISAs can be thought of as having only two rings and with ring permissions being just present or absent (again, the RISC‑V U bit in PTEs is one example).
Michael D. Schroeder⁏s PhD thesis, MAC TR-104, Cooperation of Mutually Suspicious Subsystems in a Computer Utility, September 1972 presented a generalization of rings to domains where permissions were specified without nesting. This is straight-forward, until the procedure for evaluating permissions of reference parameters using the privilege of the calling domain is attempted. SecureRISC does not attempt to generalize rings to domains due to this complexity.

Ring brackets

Each segment has three 3‑bit ring numbers—R1, R2, and R3—stored in the segment descriptor table and used for bracketing accesses by ring of execution in addition to the Read, Write, Execute permissions from the segment descriptor table, and in addition specifies gate access permission. To reiterate, SecureRISC inverts the ring number to privilege mapping chosen by Multics: ring 6 is the most privileged and ring 0 the least privileged. Typically R3≤R2≤R1. Writes are permitted when the current ring of execution is in [R1:6], reads in [R2:6], execution in [R2:R1], and calls to gates in [R3:R2−1]^*. Because all eight rings are not required, SecureRISC uses the value 7 to encode the RWX permissions to reduce TLB space. In particular, R = R2≠7, W = R1≠7, X = R3≠7.
* The ring number of the caller and the ring brackets of the target segment are used to calculate the new ring number of execution, as per the Multics documentation modified for the inverted ring order:

If the caller’s ring is within the execute bracket [R2:R1], execution will continue in the same ring as the caller.
If the caller’s ring is within the gate bracket [R3:R2−1] and the target Basic Block descriptor has the gate bit set, the process will switch to ring R2 before executing the target, increasing privilege.
If the caller’s ring is above the execute bracket (> R1), the process will switch to ring R1 before executing the target, decreasing privilege.
If the caller’s ring is above the gate bracket (< R3) or the target Basic Block Descriptor does not have the gate bit set, an access fault exception occurs.

Example Ring Brackets
What	R1,R2,R3	seg RWX	R b	W b	X b	G b	Ring 0	Ring 1	Ring 2	Ring 3	Rings 4 to 6
User code	`7,1,1`	`R-X`	`[1,6]`	`-`	`[1,6]`	`-`	`----`	`R-X-`	`R-X-`	`R-X-`	`R-X-`
User stack or heap	`1,1,7`	`RW-`	`[1,6]`	`[1,6]`	`-`	`-`	`----`	`RW--`	`RW--`	`RW--`	`RW--`
User return stack	`3,1,7`	`RW-`	`[1,6]`	`[3,6]`	`-`	`-`	`----`	`R---`	`R---`	`RW--`	`RW--`
User read-only file	`7,1,7`	`R--`	`[1,6]`	`-`	`-`	`-`	`----`	`R---`	`R---`	`R---`	`R---`
Supervisor driver code	`7,2,2`	`R-X`	`[2,6]`	`-`	`[2,6]`	`-`	`----`	`----`	`R-X-`	`R-X-`	`R-X-`
Supervisor driver data	`2,2,7`	`RW-`	`[2,6]`	`[2,6]`	`-`	`-`	`----`	`----`	`RW--`	`RW--`	`RW--`
Supervisor code	`7,3,3`	`R-X`	`[3,6]`	`-`	`[3,6]`	`-`	`----`	`----`	`----`	`R-X-`	`R-X-`
Supervisor heap or stack	`3,3,7`	`RW-`	`[3,6]`	`[3,6]`	`-`	`-`	`----`	`----`	`----`	`RW--`	`RW--`
Compiler library	`7,0,0`	`R-X`	`[0,6]`	`-`	`[0,6]`	`-`	`R-X-`	`R-X-`	`R-X-`	`R-X-`	`R-X-`
Supervisor gates for user	`7,3,1`	`R-X`	`[3,6]`	`-`	`[3,6]`	`[1,2]`	`----`	`---G`	`---G`	`R-X-`	`R-X-`
Sandboxed JIT code	`1,0,0`	`RWX`	`[0,6]`	`[1,6]`	`[0,1]`	`-`	`R-X-`	`RWX-`	`RW--`	`RW--`	`RW--`
Sandboxed JIT stack or heap	`0,0,7`	`RW-`	`[0,6]`	`[0,6]`	`-`	`-`	`RW--`	`RW--`	`RW--`	`RW--`	`RW--`
Sandboxed JIT return stack	`1,0,7`	`RW-`	`[0,6]`	`[1,6]`	`-`	`-`	`R---`	`RW--`	`RW--`	`RW--`	`RW--`
User gates for sandbox	`7,1,0`	`R-X`	`[1,6]`	`-`	`[1,6]`	`[0,0]`	`---G`	`R-X-`	`R-X-`	`R-X-`	`R-X-`

Gate

Gates are the entry points into more privileged rings from less privileged rings and are marked as such by a bit in basic block descriptors. Less privileged rings may call directly to more privileged rings without employing the exception mechanism (not employing exceptions is a performance advantage). When the target segment does not allow execution to the current ring (i.e. the current ring is less than the target segment’s R2), but does allow gate calls (i.e. the current ring is in the target segments’s [R3:R2−1]), a ring transition takes place. Only basic block descriptors marked as gates (as indicated in the descriptor) may be used for such transfers. Gates are responsible for stack switching, validating the ring numbers of pointer arguments passed in registers, and clearing non-preserved registers before return.

Discretionary Access Control

The operating system maintains an Access Control List (ACL) for files. When files are mapped into a user address space, this ACL is mapped to permissions in the Segment Descriptor for that user. Those permissions are Read (R), Write (W), Execute (X), Pointer (P), and Capability (C) permissions.

Non-Discretionary Access Control

Non-Discretionary Access Control prevents access by independent of ACLs by implementing the Orange Book classification system. In addition to its primary purpose, it protects against trojan attacks. The Orange Book calls for two concepts: levels and categories. In SecureRISC0 this is simplified to just categories, with N levels encoded with 2^N category bits. Read access is granted to a segment when SegmentCategories ⊆ ProcessCategories. Write access is granted when SegmentCategories = ProcessCategories.
A counter to using 2^N category bits to encode N levels is simply the number of bits required, if these bits are stored in a TLB. In that case, using two mechanisms rather than one may be worth the complexity.
It is possible that Non-Discretionary Access Control could be used to implement generational garbage collection. This needs to be considered.

Address Terminology

Local Virtual Address: This is the 64‑bit address generated for Basic Block descriptor fetches, Instruction fetches, and load and store instructions based on address arithmetic on the address and index register files. Local Virtual Addresses are translated in by the first level translation mechanism starting with the Segment Descriptor Tables specified by the sdtp registers. The result of this translation is a System Virtual Address. This first-level translation is usually specified by the guest operating system supervisor. Local Virtual Addresses are sometimes abbreviated to lvaddr.
System Virtual Address: These 64‑bit addresses are used by processors and I/O devices when interfacing to the System Interconnect. Initiator ports on the System Interconnect translate (and check) these addresses to System Interconnect Addresses in Initiator TLBs based on the system Region Descriptor Table. This second-level translation is usually specified by the system hypervisor. System Virtual Addresses are sometimes abbreviated to svaddr.
It would be desirable to support >64‑bit svaddrs, but there is limited room in Page Table Entries with a 4 KiB page size. If SecureRISC were to raise the minimum page size, I think this should be increased a little.
System Interconnect: The system-specific logic with multiple ports that allows these ports to communicate with each other. It may be implemented with a bus, cross-bar, ring, hypercube, or other mechanisms. Ports on the System Interconnect may be either Initiators or Responders or both.
System Interconnect Address: The address used for routing data transfers on the system interconnect. The width of the system interconnect address is system-dependent. System interconnect addresses are the result of translating System Virtual Addresses using the system-wide second-level translation specified by the Region Descriptor Table (RDT). The RDT has room to support system interconnect addresses up to 76 bits in width in the RDE word 1, but only 66 bits in width in System Page Table Entries (SPTEs). Because I don't see a strong reason to have more bits in siaddrs than svaddrs, SecureRISC should just stick with a maximum siaddr width of 64 bits. Many systems will have a smaller System Interconnect Address width. System Interconnect Addresses are sometimes abbreviated to siaddr.

Tagged Pointer Terminology

Word

Memory Word
71	64	63	0
tag		data
8		64

Words are 72 bits in memory with 64 bits of data and 8 bits of tag. The tag is primarily used for pointers to give the size of the memory addressed by the pointer, but tag values ≥240 are reserved for 64 bits of data contained directly in the word, rather than what the word points to. This allows dynamically typed languages such as Lisp to have 64‑bit integer and 64‑bit floating-point data as objects without allocating memory to contain it. Tags <240 represent pointers.
A doubleword is two words of course. I avoid the terms half-word, quarter-word, because there is no such thing as half or a quarter of tag. Loads and stores that reference 8‑bit, 16‑bit, and 32‑bit fields of a 64‑bit integer are named with the number of bits (8, 16, or 32) and a Signed/Unsigned indication (S/U). Such loads and stores take an exception if the tag in the word is not an integer tag. There may also be a few redundant tags used that duplicate pointers to a few fixed sizes areas for type distinguishing purposes, such as Lisp’s CONS being a separate type from other two-word pointers. Various numeric types with more than 64 bits of data have their own tag values making it easy to test. Since most tagged pointers have the low three bits zero, it is also possible to encode dynamic type information there. Basic block descriptors have their own special tag, as does the header word of memory area > 128 words.

Tagged Non-pointer Data

64‑bit integer data
71	64	63	0
240		integer
8		64

IEEE-754 64‑bit floating point data
71	64	63	0
244		float64
8		64

Null/Nil pointer

Null Pointer
71	64	63	61	60	0
0		ring		0
8		3		61

A pointer to 0-length data (tag 0) and address 0 is used as a null pointer. Any reference through this pointer causes an exception (Lisp may need a special load instruction to get nil on CAR and CDR of nil). There are BEQN and BNEN 16‑bit instructions for branching on null pointers, since this is so common.

Sized Pointers

Sized Word Pointer
71	64	63	61	60	3	2	0
1..128		ring		word address		0
8		3		58		3

Sized Byte Pointer
71	64	63	61	60	0
129..135		ring		byte address
8		3		61

C++ Pointer
71	64	63	61	60	0
137		ring		byte address
8		3		61

Tags 1 to 128 represent pointers to that number of words (8 B to 1 KiB).
Tags 129 to 135 represent pointers to 1 to 7 bytes, and are primarily used for reference parameters.
Tag 137 is used for C++ pointers without an associated size. Use of these pointers disables bounds checking. The ability to use tag 137 must be enabled by the supervisor; if not enabled then these pointers cause an exception.
Tag 136 is used for pointers to regions > 128 words where the size is stored at the pointer address − 8:

Pointer with size at virtual address − 8
71	64	63	61	60	3	2	0
136		ring		word address		0
8		3		58		3

Providing a special tag for headers and trailers of allocated blocks allows a backward scan to find the start of the block. This may be useful in some applications.

Size word stored at pointer − 8
71	64	63	61	60	3	2	0
254		0		word count		0
8		3		58		3

Size word stored at pointer + size
71	64	63	61	60	3	2	0
254		7		− word count		0
8		3		58		3

Code Pointers

Code pointers are used for function calls and returns, and for implementing switch statements. CHERI capabilities may also be used as code pointers. Calls and jumps using pointers without tag 192 or 200 trap.

Pointer to Basic Block Descriptor
71	64	63	61	60	3	2	0
192		ring		BB descriptor word address		0
8		3		58		3

CHERI Capabilities

CHERI capabilities are stored in memory doublewords and may be loaded into ARs with the LAC instruction. Word 1 of a CHERI capability is given a special tag. The word 0 and 1 tag values of CHERI capabilities may only be created by ring 6 and by CHERI instructions that derive from other CHERI capabilities.

Word 0 of CHERI capability
71	64	63	61	60	0
200		ring		Local virtual address
8		3		61

Word 1 of CHERI capability
71	64	63	0
252		CHERI capability bits
8		64

Trap on load or store

One tag is defined to cause an exception when it is referenced on a load or store. This is useful for detecting accesses to freed memory, which is a source of security issues. A special instruction is provided to overwrite such words. Trap on load is also useful for dynamic linking.

Trap on load or store tag
71	64	63	0
255		data
8		64

Dynamic Typing

As noted earlier, it is useful to provide tags for Common Lisp, Python, and Julia types, even when they are simply pointers to fixed-sized memory, and could theoretically use tags 1..128. This would consume perhaps 10 more tags, as illustrated in the following with the assumption that other types could employ the structure type or something like it (perhaps some of following could do so as well).

Tag	Lisp	Julia	Data use
1..128	simple-vector?	Tuple?	Pointer to N words
129..135			no dynamic typing use
136	simple-vector?	Tuple?	Pointer to N words
137..223			no dynamic typing use
224	CONS		Pointer to a pair
225	Function		Pointer to a pair
226	Symbol		Pointer to structure
227	Structure	Structure?	Pointer to structure
228..229			no dynamic typing use
230	Array		Pointer to structure
231	Vector		Pointer to structure
232	String		Pointer to structure
233	Bit-vector		Pointer to structure
234	Ratio	Rational	Pointer to pair
235	Complex	Complex	Pointer to pair
236	Bigfloat	BigFloat	Pointer to structure
237	Bignum	BigInt	Pointer to structure
238..239			no dynamic typing use
238		Int128	Pointer to pair, −2¹²⁷..2¹²⁷−1
239		UInt128	Pointer to pair, 0..2¹²⁸−1
240	Fixnum	Int64	−2⁶³..2⁶³−1
241		UInt64	0..2⁶⁴−1
242	Character	Bool, Char, Int8, Int16, Int32, UInt8, Uint16, Uint32	UTF-32 + modifiers, subtype in upper 32 bits
251			no dynamic typing use
244	Float	Float64	IEEE-754 binary64
245		Float16, Float32	subtype in upper 32 bits
246..255			no dynamic typing use

Python and Other Language Types

In addition to Lisp types, SecureRISC0 could define tags for other dynamically typed languages, such as Python. Tuples, ranges, and sets might be examples. Other types, such as modules, might use a general structure-like building block rather than individual tags, as suggested for Lisp above.

Block Oriented ISA Terminology

Basic Block

A series of instructions with control transfers only before the first instruction and after the last.

Basic Block descriptor

Basic Block Descriptor
71	64	63	61	60	50	49	41	40	25	24	15	14	11	10	9	6	5	0
253		hint		targr		targl		start		offset		size		c	next		prev
8		3		11		9		16		10		4		1	4		6

All control transfers are to Basic Block (BB) descriptors, not to instructions. The basic block descriptor points to the instructions and gives the details of the control transfers to successor basic blocks. For basic blocks with conditional branches, the conditional branch prediction is made when the basic block descriptor is executed, and checked when the conditional branch instruction in the basic block is executed. The conditional branch instruction only has the operands to decide on taken or not-taken; the branch offset is stored in the descriptor, not the instruction. Thus conditional branches look like other ALU instructions and may occur anywhere in the basic block, and need not be the last instruction (earlier placement may reduce the branch misprediction penalty).
The size of the basic block (0..8) is given in 32‑bit units (0 to 32 bytes, which would typically allow for 8 to 16 instructions for 32‑bit and 16‑bit instructions) in a BB. The size values 9..15 are reserved. If the BB size is larger than 32 bytes, then it is continued using a fall-through next field. The 15‑bit start field gives a bit mask specifying which 2‑byte locations start instructions, which allows parallel instruction decode to begin as soon as the instruction bytes are read from the instruction cache. The start bit for the first 16 bits is implicitly 1 and is not stored. The last 1 bit in the start field represents the end of the last instruction. If the last instruction ends before a 32‑bit boundary, the last 16 bits is filled with an illegal instruction.
The tentative design choice is to increase locality by storing basic group block descriptors and instructions into 4 KiB regions of the address space (called bages) with the basic block descriptors in the one half and the instructions in the other half (the compiler might alternate the half used for even and odd bages to minimize set conflicts). This allows the pointer from the descriptor to 32‑bit aligned instructions to be only 10 bits, and in a paged system, the same TLB entry maps both the descriptors and instructions (since bage size ≤ page size), so only the BB engine requires a TLB (its translations are simply forwarded to the instruction fetch engine). The instructions are fetched from
PC_63..12 ∥ offset ∥ 0²
in parallel with the BB engine moving to fetch the next BB descriptor. For non-indirect branches and calls, the target is given by an 11‑bit signed relative 4 KiB delta from the current bage and a 9‑bit unsigned 8‑byte aligned descriptor address within that bage. Specifically
TargetPC ← PC_63..61 ∥ (PC_60..12 + (targr₁₀³⁸∥targr)) ∥ targl ∥ 0³.
The low targl field is sufficient to index a set-associative BB descriptor cache that uses bits 11..3 (or a subset) as a set index without waiting for the targr addition giving the high bits. As an example, a 32 KiB, 8‑way set associate BB descriptor could read the tags in parallel with completing the addition giving the high address bits for tag comparison. When targr = 0, the TLB translation for the current BB remains valid, and energy can be saved by detecting this case. (Implementation note: of course the addition could also be done when the descriptor is fetched into the L1 BB Descriptor Cache to further speed things up at the cost of making this cache address space dependent.)
For even bages, BB descriptors start at the beginning of a bage, and instructions start on a 64‑byte boundary in the bage. Any padding between the last BB descriptor and the first instruction employs an illegal tag. For odd bages, BB descriptors are typically packed at the end starting on a 64‑byte boundary and the instructions start at the beginning. Intermixing BB descriptors and instructions is possible, but is not ideal for prefetch or cache utilization.
The c field indicates that the BB contains a MOVCOUNTA instruction, which tells the BB engine to predict the count until the AR engine sends the actual loop count value back. Often the AR engine does so before the final iteration, and the loop is predicted precisely even if the loop count prediction is simply infinite.
The hint field will be defined in the future for prediction hints specific to each next field value. For example, conditional branches will use the hint field with a taken/not-taken initial value for prediction, a hysteresis bit (strong/weak), and an indication of whether global history is likely to be useful in prediction. Similarly indirect jumps and calls may have hints appropriate to their prediction. More hint bits would be nice to have.
Note: A future expansion of BB descriptor types is possible by employing another tag (e.g. 251).

Program Counter

The Program Counter (PC) is a processor register giving the current Basic Block Descriptor (a 8‑byte aligned pointer with tag 192). In normal operation a Basic Block is executed in its entirety, and so the instruction within the Basic Block need only be identified when exceptions stop execution in the middle of a basic block. Calls only store 8‑byte aligned values and returns trap on non-aligned values. Exceptions do store the full PC.

Other SecureRISC0 Terms and Concepts

Stack

Stacks grow upward rather than downward. Each process thread is given two stacks per ring, each in its own segment. One stack of the pair is used only for return addresses, which are pushed and popped in a specialized return address stack cache containing several cache lines of return addresses (typically 2–8 lines representing 16–64 return addresses). The return address stack segment is typically write-protected from the current ring of execution except during call operations, and can only be manipulated by calling a more privileged ring. The return address stack cache is kept coherent with the processor data caches used by load and store instructions.
Downward growing stacks made sense when the heap and stack were at opposite ends of a small address space, and bi-directional growth allowed each to grow to fill the space between without predetermined limits to each. This is no longer necessary with each occupying its own segment. Downward stack growth also allows for positive offsets from the stack pointer to access stack locations, but to maximize the range of reach of limited immediate offsets, one would bias the SP by half of the immediate size anyway, which makes this unnecessary.

Address Translation

SecureRISC0 implements two levels of address translation, as in processors with hypervisor support and virtualization, but I have invented new terminology for the process, because physical address is somewhat ambiguous in a two-level translation. Programs operate using local virtual addresses. These addresses are translated to a system virtual address in a mapping specified by guest operating systems. The guest operating systems consider system virtual addresses as representing physical memory, but actually these addresses are translated again by a system-wide mapping specified by the hypervisor to system interconnect addresses that are used in the routing of accesses in the system fabric. All ports on the system interconnect translate system virtual addresses to system interconnection addresses in local TLBs at the boundary into the system interconnect. This allows guest operating systems to transmit system virtual addresses to directly to I/O devices, which may transfer data to or from these addresses, employing the system-wide translation at the port boundary.

Local Virtual Address
63	61	60	58	57	48	47	0
ring		SG		SEG		offset
3		3		10		48

System Virtual Address
63	50	49	0
region		offset
14		50

Control-flow Integrity

Many attacks on conventional processors exploit sneaking trojan data into the memory of a process. Since that memory typically lacks execute permission, the attacker instead depends upon causing existing instructions to execute the attacker’s algorithm using bogus data. Return-oriented programming (ROP) is one method to exploit existing instructions by overwriting the return address on the stack so that the return transfers to carefully chosen address that executes a few instructions and then returns to a new address. Often only a portion of a basic block containing a return is executed in this way. The basic block descriptor mechanism defeats this, but in addition it is possible to add bits to the descriptor to trap when a return targets a basic block that is not following a call. In addition by moving the return address stack into protected memory, overwrites are prohibited.

Branch avoidance

SecureRISC0 has several features that reduces the demands on branch prediction, which improves performance. The Boolean Registers (BRs) are one aspect of the ISA that enables some branch avoidance.

Trap instructions

SecureRISC0 contains a rich set of trap instructions that cause an exception based on various conditional tests. This allows the compiler to supplement the checking mandated by the SecureRISC0 ISA with its own checks. Trap instructions do not use branch prediction resources and in some micro-architectures are almost free to execute with minor performance impact, except for their code size and fetch bandwidth requirements.

Loop count

Rather than depending upon the conditional branch predictor to predict loop iteration counts, SecureRISC0 defines one dedicated register named COUNT that can be used to handle the innermost loop without branch prediction. This register must be written by the MOVCOUNTA instruction with the number of iterations prior to the start of the loop in a BB with the COUNT bit set in its descriptor. The microarchitecture employs count prediction on these BBs that set COUNT. This prediction is be replaced by the actual value when the MOVCOUNTA executes in the AR engine, which is often before the first or second loop back. When the last BB of the loop wants to loop back, it uses a BB descriptor next code that is defined as decrementing COUNT and branching to the target if COUNT is not zero. This feature allows SecureRISC0 to achieve DSP-like performance on simple loops and reduces the burden on the branch predictor, making it more effective on real conditional branches. The BB containing a loop test must also contain the LOOP instruction to decrement the architectural (non-shadow) COUNT register.

Pointer Permission

In addition to Read, Write, and Execute (in the form of the number of BB descriptors per 4 KiB region) permissions, SecureRISC0 includes Pointer and Capability permission bits in Segment Descriptors. Only segments with the P bit set are allowed to contain pointers to other segments. Stack and heap segments would typically have P set, but code and mapped data files would have P clear. Segments with P clear may only contain local pointers, which consists of just the offset within the segment. A special instruction allows such pointers to be converted to full pointers when loaded into an Address Register. This allows a database to contain internal pointers that are independent of the address to which the segment is mapped at runtime.

Capability Permission

Capability Permission allows the segment to contain CHERI capabilities.

Instruction Set

The user process state includes:

Name	Depth	Width	Read ports	Write ports	Description
PC	1	3 + 58 + 5			The Program Counter holds the current ring number, Basic Block descriptor address, and 5‑bit offset into the basic block of the next instruction. The 5‑bit offset is only visible on exceptions.
CSP	8	3 + 58			The Call Stack Pointer holds the ring number and address of the return address stack maintained by call and return basic blocks. The Program Stack Pointer is held in an AR designated by the Software ABI. There is one CSP per ring.
COUNT	1	64			The Loop Count register is used for the innermost loop that iterates up to a number of times determined prior to entering the loop. It is set by the MOVCOUNTA instruction and is decremented to zero by the LOOP instruction.
CARRY	1	64			The Carry register is used on multiplication as an implicit input and output on multiplication as follows: p ← SR[c] + (SR[a] ×_u SR[b]) + CARRY SR[d] ← p_63..0 CARRY ← p_127..64 It could also be used in the ADDC instruction as follows: s ← SR[a] +_u SR[b] + CARRY₀ SR[d] ← s_63..0 CARRY ← 0⁶³ ∥ s₆₄ but in this case it may be preferable to use BR source and destinations instead.
VL	1	64			The Vector Length register specifies the length of vector loads, stores, and operations.
VSTART	1	7			The Vector Start register is used to restart vector operations after exceptions. Details to follow.
VM	4	128			The Vector Mask register file stores a bit mask for elements of vector operations. VM[0] is hardwired to all 1s and is used for unmasked operations.
AR	16	133	2	1	The Address Register file holds pointers and integers to perform calculations related to control flow and to load and store address generation. No AR is hardwired to 0. Bits 63..0 are address or data (bits 63..61 are the ring number if address), bits 71..64 are the tag, and bits 132..72 are the size expanded from the tag, potentially set by reading address−8. In some micro-architectures, operations on ARs are executed speculatively. (Non-AR operations may be queued until non-speculative, or may be speculatively executed as well.)
XR	16	72	2	1	The Index Register file holds integers to perform calculations related to control flow and to load and store address generation. No XR is hardwired to 0. Bits 63..0 are data and bits 71..64 are the tag. The XR primarily holds integer tagged data, but other tags may be loaded. In some micro-architectures, operations on XRs are executed speculatively. (Non-XR operations may be queued until non-speculative, or may be speculatively executed as well.) The XR register file requires two read ports and one write port per instruction.
SR	16	72	3	1	The Scalar Register file holds data for computations not involved in address generation and primarily hold integer or floating-point values. Tags are stored, and so SRs may be used for copying arbitrary data, including pointers, but no instruction uses SRs as an address (e.g. base) register. Integer operations check for integer tags, and floating-point operations check for float tags. No SR is hardwired to 0. In some micro-architectures, operations on SRs occur later in the pipeline than operations on ARs, separated by a queue, allowing these operations to wait for data cache misses while the AR engine continues to move ahead generating addresses. When multiple functional units operate in parallel, only some will support 3 source operands, with the others only two. The instructions with three SR source operands are multiply/add (both integer and floating-point), and funnel shifts.
BR	16	1	3	1	Boolean Registers hold boolean values, such as the result of comparisons and logical operations on other boolean values. BRs are typically used to hold SR register comparisons and may avoid branch prediction misses in some algorithms. BR[0] is hardwired to 0. Attempts to write 1 to BR[0] trap, which converts such instructions into negative assertions.
VR	16	72 × 128	3	1	Vector Registers hold vectors of tagged data, typically integers or floating-point data.

The SR register file must support 3 read and 1 write port per instruction for floating-point multiply/add instructions at least. Since it does, other operations on SRs may take advantage of the third source operand.

Basic Block Descriptor Types

The next field of the BB descriptor is used to specify how the successor to the current BB is determined. The values are given in the following table:

Value	Description
0	Unconditional branch: The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below.
1	Conditional branch: The branch predictor is used to determine whether this branch is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. There should be exactly one branch, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below, or is the fall-through BB descriptor at PC + 8.
2	Call: The address PC + 8 is written to the word pointed to CSP[TargetPC_63..61], and CSP[TargetPC_63..61] is incremented by 8. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below.
3	Conditional Call: The branch predictor is used to determine whether this call is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the call indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. In the case the call is not taken the destination is fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC_63..61], and CSP[TargetPC_63..61] is incremented by 8.
4	Loop back: The shadow value of the COUNT register is used to determine whether this branch is taken or not, and this prediction is checked by the LOOP instruction in the instructions of the basic block. There should be exactly one LOOP, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below, or is the fall-through BB descriptor at PC + 8.
5	Conditional Loop back: The branch predictor is used to determine whether this loop back is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the loop back is enabled by the branch, the shadow value of the COUNT register is used to determine whether this loop is taken or not, and this prediction is checked by the LOOP instruction in the instructions of the basic block. There should be exactly one LOOP, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below, or is the fall-through BB descriptor at PC + 8.
6	Fall through: This Basic Block is unconditionally followed by the BB at PC + 8.
7	Reserved.
8	Jump Indirect: The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions.
9	Conditional Jump Indirect: The branch predictor is used to determine whether this jump indirect is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the jump indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. In the case the jump is not taken the destination is fall-through BB descriptor at PC + 8. This type is expected to be used for case dispatch, where the conditional test checks whether the value is within range, and the JUMP uses PC ← PC + (XR[b] × 8) to choose one of several dispatch basic block descriptors, presuming that the BBs fit in the same 4 KiB region (if not then a table and PC ← lvload72(AR[a] + XR[b]) should be used).
10	Call Indirect: The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. The address PC + 8 is written to the word pointed to CSP[TargetPC_63..61], and CSP[TargetPC_63..61] is incremented by 8.
11	Conditional Call Indirect: The branch predictor is used to determine whether this call indirect is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the call indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. In the case the call is not taken the destination is fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC_63..61], and CSP[TargetPC_63..61] is incremented by 8.
12	Return: The Call Stack cache is used to predict the return using CSP[PC_63..61] − 8 as the index and CSP[PC_63..61] is decremented by 8.
13	Reserved.
14	Reserved.
15	Reserved.

The prev field of the BB descriptor is used to specify what methods are allowed to get to this BB as a set of bits, with prev_1..0 controlling interpretation of prev_5..2:

prev_1..0 = 0
Bit	Description
2	Fall through to this BB allowed
3	Branch/Loopback to this BB allowed
4	Jump to this BB allowed (for case dispatch)
5	Return to this BB allowed

prev_1..0 = 1
Bit	Description
2	Call allowed
3	Gate allowed
4	Reserved
5	Reserved

prev_1..0 = 2
Bit	Description
2	Reserved
3	Reserved
4	Reserved
5	Reserved

prev_1..0 = 3
Bit	Description
2	Exception entry
3	Reserved
4	Reserved
5	Reserved

Call/Return Details

Basic Block descriptors with one of the four call types (Call, Conditional Call, Call Indirect, Conditional Call Indirect), push the return address on a protected stack addressed by the CSP indexed by the target ring number (which is the same as the current ring number unless a gate is addressed). Returns pop the address from the protected stack and jump to it. The ring number of the CSP pointer is used for the stores and loads, and typically this ring is not writeable by the current ring.
The call semantics are as follows:
lvstore72(CSP[TargetPC_63..61]) ← PC
CSP[TargetPC_63..61]) ← CSP[TargetPC_63..61]) +_p 8
The return semantics are as follows:
PC ← lvload72(CSP[TargetPC_63..61] −_p 8)
CSP[TargetPC_63..61]) ← CSP[TargetPC_63..61]) −_p 8

Overflow Checking

Overflow detection is important for implementing bignums in languages such as Lisp. SecureRISC0 provides a reasonably complete set of such instructions in addition to the usual mod 2⁶⁴ add, subtract, negate, multiply, and shift left.

Unsigned overflow could be detected by using the ADDC and SUBC instructions with BR[0] as the carry-in and BR[0] as the carry-out. But it might also make sense to have ADDUO (Add Unsigned with Overflow).

In addition the ADDSO, ADDUSO (Add Signed with Overflow), SUBSO (Subtract Signed with Overflow), SUBSUO (Subtract Signed Unsigned with Overflow), SUBUSO (Subtract Unsigned Signed with Overflow), and NEGO (Negate with Overflow) instructions provide overflow checking for signed addition, subtraction, and negation, and signed-unsigned addition and subtraction. There is also SLLO (Shift Left Logical with Overflow) and SLAO (Shift Left Arithmetic with Overflow) in addition to the usual SLL. Finally there are MULUO, MULSO, and MULSUO for multiplication with overflow detection.

Overflow in the unsigned addition of load/store effective address generation is trapped. Segment bounds are also checked during effective address generation: the segment size is determined from the base register, and the effective address must agree with the base register for bits 60..size (this is hard to implement—might need another cache?).

Branch Avoidance

SecureRISC0 has trap instructions and Boolean Registers (BRs) primarily as a way to avoid conditional branching for computation. For example, to compute the min of x1 and x3 into x6, the RISC‑V ISA would use conditional branches:

	move x6, x1
	blt x1, x3, L
	move x6, x3
L:

The performance of the above on contemporary micro-architectures depends on the conditional branch prediction rate and the mispredict penalty, which in turn depends on how consistently x1 or x3 is the minimum value. In SecureRISC0, the sequence could be as follows:

	lt b2, s1, s3
	sel s6, b2, s1, s3

This sequence involves no conditional branches, and has consistent performance.

As another example, the range test

	assert ((lo <= x) && (x <= hi));

on RISC‑V would compile to

	blt x, lo, T
	bge hi, x, L
T:
	jal assertionfailed
L:

but on SecureRISC0 would compile to

	lt b1, x, lo
	orle b0, b1, hi, x

which involves no conditional branches, but instead using writes to b0 as a negative assertion check (trap if the value to be written is 1). The assembler would also accept torle b1, hi, x as equivalent to the above orle by supplying the b0 destination operand.

Even when conditional branches are used, the boolean registers sometimes permit several tests to be combined before branching, so if we were branching on the range test above, instead of asserting it, the code might be

	lt b1, x, lo
	borle b1, hi, x, outofrange

which has one branch rather than two.

Tag Checking

Operations on tagged values trap if the tags are unexpected values. Integer addition requires that both tags be integers, or one tag be a pointer type and the other an integer. Integer subtraction requires the subtrahend tag to be an integer tag and the minuend to be either an integer or pointer tag. The resulting tag is integer with all integer sources, or pointer if one operand is a pointer. Integer bitwise logical operands and shifts require integer tagged operands and produce an integer tagged result. Floating point addition, subtraction, multiplication, division, and square root require floating-point tagged operands. To perform integer operands on floating-point tagged values (e.g. to extract the exponent) requires a CAST instruction to first change the tag. Similarly to perform logical operations on a pointer, a CAST instruction to integer type is required.

Comparisons of tagged values compare the entire word in its entirety for =, ≠, <_u, ≥_u etc. This allows sorting regardless of type. Similarly the CMPU operation produces −1, 0, 1 based on <_u, =, >_u of word values.

Shifts

One advantage of the 3 read SR file is that shifts can be based upon a funnel shift where the value to be shifted is the catenation of SR[a] and SR[b], allowing for rotates by specifying the same operand for the high and low funnel operands, and multiword shifts by supplying adjacent source words of the multiword value. The basic operations are then
SR[d] ← (SR[b] ∥ SR[a]) >> imm6,
SR[d] ← (SR[b] ∥ SR[a]) >> (SR[c] mod 64), and
SR[d] ← (SR[b] ∥ SR[a]) >> (−SR[c] mod 64).
Conventional logical and arithmetic shifts are also provided. Left shifts supply 0 for the lo side of the funnel and use a negative shift amount. Logical right shifts supply 0 on the high side of the funnel and arithmetic right shifts supply a signed-extended version of SR[a] on the high side of the funnel. Need to decide whether overflow detecting left shifts are required.

The CARRY register could be use as funnel shift operand instead of an SR, but that seems less flexible.

Multiword Addition

The Add with Carry instruction ADDC is defined to take a BR source as a carry-in and BR destination as a carry-out as a destination for multiword addition. The definition is then
s ← SR[a] +_u SR[b] +_u BR[c]
SR[d] ← s_63..0
BR[e] ← s₆₄.
An alternative requires fewer operands, but uses one bit in the 64‑bit CARRY register:
s ← SR[a] +_u SR[b] +_u CARRY₀
SR[d] ← s_63..0
CARRY ← 0⁶³ ∥ s₆₄.

Multiword Multiplication

The ideal multiplication operation would be
SR[e],SR[d] ← (SR[a] ×_u SR[b]) + SR[c] + SR[f]
to efficiently support multiword multiplication, but that requires 4 reads and 2 writes, which we clearly don’t want. The tentative alternative is to introduce a 64‑bit CARRY register to provide the additional 64‑bit input to the 128‑bit product and a place to store the high 64 bits of the product. This requires some careful thought for OoO micro-architectures and so is a tentative proposal. It may be that even an OoO processor will be called on to have a subset of instructions that are to be executed in-order relative to each other, and the multiword arithmetic instructions can be put in this queue.

Code Size Reduction

It may be appropriate to add some instructions that exist only for code size reduction, which expand into multiple SecureRISC instructions early in the pipeline (e.g. before register renaming). The best candidates for this so far are doubleword load/store instructions, which would expand into two singleword load/store instructions. This expansion and execution as separate instructions in the backend of the pipeline avoids the issues with register renaming that would otherwise exist. The partial execution of part of the pair would be allowed (and loads to the source registers would be not allowed). Doubleword load/store significantly reduce the size of function call entry and exit, and may be useful for loading a code pointer and context pointer pair for indirect calls.

Instruction Formats and Overview

The following outlines some of the instructions without giving them their full definitions, which includes tag and bounds checking. The full definitions will follow later.

The 16‑bit instruction formats are included for code density. Some evaluation of whether it is worth the cost should be considered. Note that the BB descriptor gives the sizes of all instructions in the basic block in the form of the start bit mask, and so the instruction size is not encoded in the opcodes. The start mask allows multiple instructions to be decoded in parallel without parsing the instruction stream.

16‑bit instruction format destination 2 source
15	12	11	8	7	4	3	0
b		a		d		op1
4		4		4		4

ADDA	d, a, b	AR[d] ← AR[a] +_p XR[b]
ADDX	d, a, b	XR[d] ← XR[a] + XR[b]
ADD	d, a, b	SR[d] ← SR[a] + SR[b]
LAD	d, a, b	AR[d] ← lvload144(AR[a] +_p AR[b]×16)
LA	d, a, b	AR[d] ← lvload72(AR[a] +_p XR[b]×8)
LX	d, a, b	XR[d] ← lvload72(AR[a] +_p XR[b]×8)
L	d, a, b	SR[d] ← lvload72(AR[a] +_p XR[b]×8)

16‑bit instruction format destination source immediate
15	12	11	8	7	4	3	0
imm4		a		d		op1
4		4		4		4

ADDAI	d, a, imm4	AR[d] ← AR[a] +_p imm4
ADDXI	d, a, imm4	XR[d] ← XR[a] + imm4
ADDI	d, a, imm4	SR[d] ← SR[a] + imm4
LAI	d, a, imm4	AR[d] ← lvload72(AR[a] +_p imm4×8)
LXI	d, a, imm4	XR[d] ← lvload72(AR[a] +_p imm4×8)
LI	d, a, imm4	SR[d] ← lvload72(AR[a] +_p imm4×8)

16‑bit instruction format destination 1 source
15	12	11	8	7	4	3	0
op1da		a		d		op1
4		4		4		4

RTAG	d, a	AR[d] ← 240 ∥ 0⁵⁶ ∥ AR[a]_71..64
RSIZE	d, a	AR[d] ← 240 ∥ 0³ ∥ AR[a]_132..72

16‑bit instruction format 2 source
15	12	11	8	7	4	3	0
b		a		op1ab		op1
4		4		4		4

BEQA	a, b	branch if AR[a] = AR[b]
BEQX	a, b	branch if XR[a] = XR[b]
BNEA	a, b	branch if AR[a] ≠ AR[b]
BNEX	a, b	branch if XR[a] ≠ XR[b]
BLTAU	a, b	branch if AR[a] <_u AR[b]
BLTXU	a, b	branch if XR[a] <_u XR[b]
BGEAU	a, b	branch if AR[a] ≥_u AR[b]
BGEXU	a, b	branch if XR[a] ≥_u XR[b]
BLTX	a, b	branch if XR[a] <_s XR[b]
BGEX	a, b	branch if XR[a] ≥_s XR[b]
BNONEX	a, b	branch if (XR[a] & XR[b]) = 0
BANYX	a, b	branch if (XR[a] & XR[b]) ≠ 0
TEQA	a, b	trap if AR[a] = AR[b]
TEQX	a, b	trap if XR[a] = XR[b]
TNEA	a, b	trap if AR[a] ≠ AR[b]
TNEX	a, b	trap if XR[a] ≠ XR[b]
TLTAU	a, b	trap if AR[a] <_u AR[b]
TLTXU	a, b	trap if XR[a] <_u XR[b]
TGEAU	a, b	trap if AR[a] ≥_u AR[b]
TGEXU	a, b	trap if XR[a] ≥_u XR[b]
TLTX	a, b	trap if XR[a] <_s XR[b]
TGEX	a, b	trap if XR[a] ≥_s XR[b]
TNONEX	a, b	trap if (XR[a] & XR[b]) = 0
TANYX	a, b	trap if (XR[a] & XR[b]) ≠ 0

16‑bit instruction format 1 source
15	12	11	8	7	4	3	0
op1a		a		op1ab		op1
4		4		4		4

BEQNA	a	branch if AR[a]_71..64 = 0
BNENA	a	branch if AR[a]_71..64 ≠ 0
BEQZX	a	branch if XR[a] = 0
BNEZX	a	branch if XR[a] ≠ 0
BLTZX	a	branch if XR[a] <_s 0
BGEZX	a	branch if XR[a] ≥_s 0
BLEZX	a	branch if XR[a] ≤_s 0
BGTZX	a	branch if XR[a] >_s 0
BF	a	branch if BR[a] = 0
BT	a	branch if BR[a] ≠ 0
TEQZX	a	trap if XR[a] = 0
TNEZX	a	trap if XR[a] ≠ 0
TLTZX	a	trap if XR[a] <_s 0
TGEZX	a	trap if XR[a] ≥_s 0
TLEZX	a	trap if XR[a] ≤_s 0
TGTZX	a	trap if XR[a] >_s 0
TF	a	trap if BR[a] = 0
TT	a	trap if BR[a] ≠ 0
JMP	a	PC ← AR[a]
LOOP		COUNT ← COUNT − 1 branch if COUNT ≠ 0

16‑bit instruction format destination immediate
15	8	7	4	3	0
imm8		d		op1
8		4		4

XI	d, imm8	XR[d] ← 240 ∥ imm8₇⁴⁸ ∥ imm8
I	d, imm8	SR[d] ← 240 ∥ imm8₇⁴⁸ ∥ imm8

32‑bit instruction format 3 sources 1 destination
31	28	27	22	21	20	19	16	15	12	11	8	7	4	3	0
op20		op21		m		c		b		a		d		op2
4		6		2		4		4		4		4		4

ao1ao2	d, c, a, b	SR[d] ← SR[c] ao1 (SR[a] ao2 SR[b])
Example ao1 might be: + (ADD) − (SUB)
Example ao2 might be: + (ADD) − (SUB) × (MUL)
FUN	d, b, a, c	t ← (SR[b]_63..0∥SR[a]_63..0) >> SR[c]_5..0 SR[d] ← 240 ∥ t_63..0
FUNN	d, b, a, c	t ← (SR[b]_63..0∥SR[a]_63..0) >> (−SR[c])_5..0 SR[d] ← 240 ∥ t_63..0
fo1fo2.D	d, c, a, b	SR[d] ← SR[c] fo1 (SR[a] fo2 SR[b])
Vfo1fo2	d, c, a, b, m	VR[d] ← VR[c] ao1 (VR[a] ao2 VR[b]) masked by VM[m]
Vfo1fo2	d, c, a, b, m	VR[d] ← VR[c] ao1 (VR[a] ao2 SR[b]) masked by VM[m]
Vfo1fo2.D	d, c, a, b, m	VR[d] ← VR[c] fo1 (VR[a] fo2 VR[b]) masked by VM[m]
Vfo1fo2.D	d, c, a, b, m	VR[d] ← VR[c] fo1 (VR[a] fo2 SR[b]) masked by VM[m]
Example fo1 might be: +_f (ADD) −_f (SUB)
Example fo2 might be: +_f (ADD) −_f (SUB) ×_f (MUL)
bo1bo2	d, c, a, b	SR[d] ← SR[c] bo1 (SR[a] bo2 SR[b])
Example bo1 might be: & (AND) \| (OR) ^ (XOR)
Example bo2 might be: & (AND) \| (OR) &~ (ANDC) \|~ (ORC) ^ (XOR) ^~ (XORC) << (SLL) >>_u (SRL) >>_s (SRA)
SEL	d, c, a, b	SR[d] ← BR[c] ? SR[a] : SR[b]
lo1lo2	d, c, a, b	BR[d] ← BR[c] lo1 (BR[a] lo2 BR[b])
lo1copA	d, c, a, b	BR[d] ← BR[c] lo1 (AR[a] cop AR[b])
lo1copX	d, c, a, b	BR[d] ← BR[c] lo1 (XR[a] cop XR[b])
lo1cop	d, c, a, b	BR[d] ← BR[c] lo1 (SR[a] cop SR[b])
Vlo1cop	d, c, a, b	VM[d] ← VM[c] lo1 (VR[a] cop VR[b])
Vlo1cop	d, c, a, b	VM[d] ← VM[c] lo1 (VR[a] cop SR[b])

32‑bit instruction format with 2 sources 1 destination and 12‑bit immediate
31	28	27	20	19	16	15	12	11	8	7	4	3	0
op20		i		c		i		a		d		op2
4		8		4		4		4		4		4

ao1ao2I	d, a, b, imm	SR[d] ← SR[c] ao1 (SR[a] ao2 imm12)
bo1bo2I	d, a, b, imm	SR[d] ← SR[c] bo1 (SR[a] bo2 imm12)
SELI	d, c, a, imm12	SR[d] ← BR[c] ? SR[a] : imm12
lo1copI	d, a, b, imm	BR[d] ← BR[c] lo1 (AR[a] cop imm12)
lo1copI	d, a, b, imm	BR[d] ← BR[c] lo1 (SR[a] cop imm12)
Vlo1copI	d, a, b, imm	VM[d] ← VM[a] lo1 (VR[b] cop imm12)
BR[0] is hardwired to 0. Using BR[0] as a destination acts as negative assertion, taking an exception if the value computed is 1.
Example lo1/lo2: & (AND) \| (OR) ^ (XOR) &~ (ANDC) \|~ (ORC) ^~ (XORC)
Example cop: = (EQ) ≠ (NE) <_u (LTU) <_s (LT) ≥_u (GEU) ≥_u (GE) tag= tag≠ tag< tag≥ word= word≠ word< word≥

32‑bit instruction format 2 sources 1 destination
31	28	27	22	21	16	15	12	11	8	7	4	3	0
op20		op21		op23		b		a		d		op2
4		6		6		4		4		4		4

op0X	d, a, b	XR[d] ← XR[a] op0 XR[b]
Example op0 might be: + (ADD) − (SUB) << (SLL) >>_u (SRL) >>_s (SRA)
Possible op0 might include: min_u min_s, max_u max_s
ao2	d, b, c	SR[d] ← SR[a] ao2 SR[b]
LX32U	d, a, b	t ← lvload32(AR[a] +_p XR[b]×4) XR[d] ← 240 ∥ 0³² ∥ t
L32U	d, a, b	t ← lvload32(AR[a] +_p XR[b]×4) SR[d] ← 240 ∥ 0³² ∥ t
LX32S	d, a, b	t ← lvload32(AR[a] +_p XR[b]×4) XR[d] ← 240 ∥ t₃₁³² ∥ t
L32S	d, a, b	t ← lvload32(AR[a] +_p XR[b]×4) SR[d] ← 240 ∥ t₃₁³² ∥ t
LX16U	d, a, b	t ← lvload16(AR[a] +_p XR[b]×2) XR[d] ← 240 ∥ 0⁴⁸ ∥ t
L16U	d, a, b	t ← lvload16(AR[a] +_p XR[b]×2) SR[d] ← 240 ∥ 0⁴⁸ ∥ t
LX16S	d, a, b	t ← lvload16(AR[a] +_p XR[b]×2) XR[d] ← 240 ∥ t₁₅⁴⁸ ∥ t
L16S	d, a, b	t ← lvload16(AR[a] +_p XR[b]×2) SR[d] ← 240 ∥ t₁₅⁴⁸ ∥ t
LX8U	d, a, b	t ← lvload8(AR[a] +_p XR[b]) XR[d] ← 240 ∥ 0⁵⁶ ∥ t
L8U	d, a, b	t ← lvload8(AR[a] +_p XR[b]) SR[d] ← 240 ∥ 0⁵⁶ ∥ t
LX8S	d, a, b	t ← lvload8(AR[a] +_p XR[b]) XR[d] ← 240 ∥ t₇⁵⁶ ∥ t
L8S	d, a, b	t ← lvload8(AR[a] +_p XR[b]) SR[d] ← 240 ∥ t₇⁵⁶ ∥ t

32‑bit instruction format with 1 source 1 destination and 12‑bit immediate
31	28	27	20	19	16	15	12	11	8	7	4	3	0
op20		i		op24		i		a		d		op2
4		8		4		4		4		4		4

op0AI	d, a, imm	AR[d] ← AR[a] op0 imm12
op0XI	d, a, imm	XR[d] ← XR[a] op0 imm12
ao2I	d, b, imm	SR[d] ← SR[a] ao2 imm12
LADI	d, a, imm	AR[d] ← lvload144(AR[a] +_p imm12×16)
LAI	d, a, imm	AR[d] ← lvload72(AR[a] +_p imm12×8)
LXI	d, a, imm	XR[d] ← lvload72(AR[a] +_p imm12×8)
LI	d, a, imm	SR[d] ← lvload72(AR[a] +_p imm12×8)
LX32UI	d, a, imm	t ← lvload32(AR[a] +_p imm12×4) XR[d] ← 240 ∥ 0³² ∥ t
L32UI	d, a, imm	t ← lvload32(AR[a] +_p imm12×4) SR[d] ← 240 ∥ 0³² ∥ t
LX32SI	d, a, imm	t ← lvload32(AR[a] +_p imm12×4) XR[d] ← 240 ∥ t₃₁³² ∥ t
L32SI	d, a, imm	t ← lvload32(AR[a] +_p imm12×4) SR[d] ← 240 ∥ t₃₁³² ∥ t
LX16UI	d, a, imm	t ← lvload16(AR[a] +_p imm12×2) XR[d] ← 240 ∥ 0⁴⁸ ∥ t
L16UI	d, a, imm	t ← lvload16(AR[a] +_p imm12×2) SR[d] ← 240 ∥ 0⁴⁸ ∥ t
LX16SI	d, a, imm	t ← lvload16(AR[a] +_p imm12×2) XR[d] ← 240 ∥ t₁₅⁴⁸ ∥ t
L16SI	d, a, imm	t ← lvload16(AR[a] +_p imm12×2) SR[d] ← 240 ∥ t₁₅⁴⁸ ∥ t
LX8UI	d, a, imm	t ← lvload8(AR[a] +_p imm12) XR[d] ← 240 ∥ 0⁵⁶ ∥ t
L8UI	d, a, imm	t ← lvload8(AR[a] +_p imm12) SR[d] ← 240 ∥ 0⁵⁶ ∥ t
LX8SI	d, a, imm	t ← lvload8(AR[a] +_p imm12) XR[d] ← 240 ∥ t₇⁵⁶ ∥ t
L8SI	d, a, imm	t ← lvload8(AR[a] +_p imm12) SR[d] ← 240 ∥ t₇⁵⁶ ∥ t
MOVACOUNT	d	AR[d] ← 240 ∥ COUNT
MOVCOUNTA	a	COUNT ← AR[a]_63..0
MOVAS	d, a	AR[d] ← SR[a]
MOVSA	d, a	SR[d] ← AR[a]
MOVAB	d, a	AR[d] ← 240 ∥ 0⁶³ ∥ BR[a]
MOVBA	d, a, imm6	BR[d] ← AR[a]_imm6
MOVSB	d, a	SR[d] ← 240 ∥ 0⁶³ ∥ BR[a]
MOVBS	d, a, imm6	BR[d] ← SR[a]_imm6
MOVSBALL	d	SR[d] ← 240 ∥ 0⁴⁸ ∥ BR[15]∥BR[14]∥…∥BR[1]∥0
MOVSVM	d, m, w	SR[d] ← 240 ∥ VM[m]_{w×64+63..w×64}
MOVVMS	d, a, w	VM[d]_{w×64+63..w×64} ← SR[a]

32‑bit instruction format with 2 sources 1 destination and 6‑bit immediate
31	28	27	22	21	16	15	12	11	8	7	4	3	0
op20		op21		imm6		b		a		d		op2
4		6		6		4		4		4		4

FUNI

d, a, b, i

t ← (SR[b]_63..0∥SR[a]_63..0) >> imm6
SR[d] ← 240 ∥ t_63..0

32‑bit instruction format 3 sources 0 destination
31	28	27	22	21	20	19	16	15	12	11	8	7	4	3	0
op20		op21		m		c		b		a		op22		op2
4		6		2		4		4		4		4		4

SA	c, a, b	lvstore72(AR[a] +_p XR[b]×8) ← AR[c]
SX	c, a, b	lvstore72(AR[a] +_p XR[b]×8) ← XR[c]
S	c, a, b	lvstore72(AR[a] +_p XR[b]×8) ← SR[c]
SX32	c, a, b	lvstore32(AR[a] +_p XR[b]×4) ← XR[c]_31..0
S32	c, a, b	lvstore32(AR[a] +_p XR[b]×4) ← SR[c]_31..0
SX16	c, a, b	lvstore16(AR[a] +_p XR[b]×2) ← XR[c]_15..0
S16	c, a, b	lvstore16(AR[a] +_p XR[b]×2) ← SR[c]_15..0
SX8	c, a, b	lvstore8(AR[a] +_p XR[b]) ← XR[c]_7..0
S8	c, a, b	lvstore8(AR[a] +_p XR[b]) ← SR[c]_7..0
Blo2	a, b	branch if BR[a] lo2 BR[b]
(equivalent to BORlo2 b0, a, b)
BEQA	a, b	branch if AR[a] = AR[b]
(equivalent to BOREQA b0, a, b)
BEQX	a, b	branch if XR[a] = XR[b]
BNEA	a, b	branch if AR[a] ≠ AR[b]
BNEX	a, b	branch if XR[a] ≠ XR[b]
BLTAU	a, b	branch if AR[a] <_u AR[b]
BLTXU	a, b	branch if XR[a] <_u XR[b]
BGEAU	a, b	branch if AR[a] ≥_u AR[b]
BGEXU	a, b	branch if XR[a] ≥_u XR[b]
BLTX	a, b	branch if XR[a] <_s XR[b]
BGEX	a, b	branch if XR[a] ≥_s XR[b]
BNONEX	a, b	branch if (XR[a] & XR[b]) = 0
BANYX	a, b	branch if (XR[a] & XR[b]) ≠ 0
Blo1lo2	c, a, b	branch if BR[c] lo1 (BR[a] lo2 BR[b])
Blo1EQA	c, a, b	branch if BR[c] lo1 (AR[a] = AR[b])
Blo1EQX	c, a, b	branch if BR[c] lo1 (XR[a] = XR[b])
Blo1NEA	c, a, b	branch if BR[c] lo1 (AR[a] ≠ AR[b])
Blo1NEX	c, a, b	branch if BR[c] lo1 (XR[a] ≠ XR[b])
Blo1LTAU	c, a, b	branch if BR[c] lo1 (AR[a] <_u AR[b])
Blo1LTXU	c, a, b	branch if BR[c] lo1 (XR[a] <_u XR[b])
Blo1GEAU	c, a, b	branch if BR[c] lo1 (AR[a] ≥_u AR[b])
Blo1GEXU	c, a, b	branch if BR[c] lo1 (XR[a] ≥_u XR[b])
Blo1LTX	c, a, b	branch if BR[c] lo1 (XR[a] <_s XR[b])
Blo1GEX	c, a, b	branch if BR[c] lo1 (XR[a] ≥_s XR[b])
Blo1NONEX	c, a, b	branch if BR[c] lo1 ((XR[a] & XR[b]) = 0)
Blo1ANYX	c, a, b	branch if BR[c] lo1 ((XR[a] & XR[b]) ≠ 0)

32‑bit instruction format 2 sources 0 destination with 12‑bit immediate
31	28	27	20	19	16	15	12	11	8	7	4	3	0
op20		i		c		i		a		op22		op2
4		8		4		4		4		4		4

SAI	c, a, imm	lvstore72(AR[a] +_p imm12×8) ← AR[c]
SI	c, a, imm	lvstore72(AR[a] +_p imm12×8) ← SR[c]
SA32I	c, a, imm	lvstore32(AR[a] +_p imm12×4) ← AR[c]_31..0
S32I	c, a, imm	lvstore32(AR[a] +_p imm12×4) ← SR[c]_31..0
SA16I	c, a, imm	lvstore16(AR[a] +_p imm12×2) ← AR[c]_15..0
S16I	c, a, imm	lvstore16(AR[a] +_p imm12×2) ← SR[c]_15..0
SA8I	c, a, imm	lvstore8(AR[a] +_p imm12) ← AR[c]_7..0
S8I	c, a, imm	lvstore8(AR[a] +_p imm12) ← SR[c]_7..0
BEQXI	a, imm12	branch if XR[a] = imm12
(equivalent to BOREQXI b0, a, imm12)
BNEXI	a, imm12	branch if XR[a] ≠ imm12
BLTXUI	a, imm12	branch if XR[a] <_u imm12
BGEXUI	a, imm12	branch if XR[a] ≥_u imm12
BLTXI	a, imm12	branch if XR[a] <_s imm12
BGEXI	a, imm12	branch if XR[a] ≥_s imm12
BNONEXI	a, imm12	branch if (XR[a] & imm12) = 0
BANYXI	a, imm12	branch if (XR[a] & imm12) ≠ 0)
Blo1EQXI	c, b, imm12	branch if BR[c] lo1 (XR[a] = imm12)
Blo1NEXI	c, a, imm12	branch if BR[c] lo1 (XR[a] ≠ imm12)
Blo1LTUXI	c, a, imm12	branch if BR[c] lo1 (XR[a] <_u imm12)
Blo1GEUXI	c, a, imm12	branch if BR[c] lo1 (XR[a] ≥_u imm12)
Blo1LTXI	c, a, imm12	branch if BR[c] lo1 (XR[a] <_s imm12)
Blo1GEXI	c, a, imm12	branch if BR[c] lo1 (XR[a] ≥_s imm12)
Blo1NONEXI	c, a, imm12	branch if BR[c] lo1 ((XR[a] & imm12) = 0)
Blo1ANYXI	c, a, imm12	branch if BR[c] lo1 ((XR[a] & imm12) ≠ 0)
SWITCHR	b	PC ← PC +_p (XR[b]×8)
Used when all cases are in the current bage.
SWITCHI	a, imm12	PC ← AR[a] +_p (imm12×8)
SWITCH	a, b	PC ← AR[a] +_p (XR[b]×8)
LJMPI	a, imm12	PC ← lvload72(AR[a] +_p imm12×8)
LJMP	a, b	PC ← lvload72(AR[a] +_p XR[b]×8)

32‑bit instruction format with 1 source 1 destination and 16‑bit immediate
31	28	27	12	11	8	7	4	3	0
op20		imm16		a		d		op2
4		16		4		4		4

ALLOCI	d, a, imm7	AR[d] ← 0⁵¹∥imm7∥0³ ∥ imm7 ∥ min(AR[a]_63..61, PC_63..61) ∥ (AR[a]_60..0 + AR[a]_132..72)
ALLOCI	d, a, imm16	lvstore72(AR[a] + 8) ← 254∥0⁴⁸∥imm16 AR[d] ← −(0⁴⁸∥imm16) ∥ 254 ∥ min(AR[a]_63..61, PC_63..61) ∥ (AR[a]_60..0+AR[a]_132..72+16)
Primarily used for allocating stack frames with a15:
ALLOCI	sp, sp, imm

32‑bit instruction format with 24‑bit immediate
31	8	7	4	3	0
imm24		d		op2
24		4		4

XI	d, imm	XR[d] ← 240 ∥ imm24₂₃⁴⁰∥imm24
I	d, imm	SR[d] ← 240 ∥ imm24₂₃⁴⁰∥imm24

Software Conventions

Data Types

I expect SecureRISC0 software to use the ILP64 model, where integers and pointers are both 64 bits. Even in the 1980s when MIPS was defining its 64‑bit ISA, I argued that integers should be 64 bits, but keeping integers 32 bits for C was considered sacred by others. The result is that an integer cannot index a large array, which is terrible. With ILP64, I don’t expect SecureRISC0 to need special 32‑bit add instructions (that sign-extend from bit 31 to bits 63..32).

Register Names and Uses

Address registers are named a0 to a15.
Another name for a15 is sp (Stack Pointer).
Another name for a14 is fp (Frame Pointer), but use of a frame pointer is optional.
Another name for a1 is cp (Context Pointer).
Index registers are named x0 to x15.
Boolean registers are named b0 to b15. b0 is hardwired to zero, and writes to b0 are negative assertions (i.e. attempts to write 1 trap). Thus the assembler provides trap mnemonics that implicitly supply b0 as a destination.
Scalar registers are named s0 to s15.
Vector registers are named v0 to v15.
Vector mask registers are named m0 to m3. Omitting a mask register specification is equivalent to specifying m0.
a2 to a9 are used for passing pointers to functions.
x2 to x9 are used for passing integers to functions.
a2 to a9 are used for returning pointers from functions (though most languages only require one or two return values). This allows the return values of a function to be passed to another as arguments.
x2 to x9 are used for returning integers from functions (though most languages only require one or two return values). This allows the return values of a function to be passed to another as arguments.
a10 to a15 are preserved by the callee (i.e. saved and restored on entry and exit if used). The callee need not preserve a0 to a9. Compilers are free to use different rules within a compilation unit, and in particular may decide whether a1 is preserved or not in a given set of siblings, since they are all in the same compilation unit. Typically a1 would be preserved by the compiler in most cases, since this avoids saves and restores.
x10 to x15 are preserved by the callee (i.e. saved and restored on entry and exit if used). The callee need not preserve x0 to x9.
Function pointers (e.g. as used for dynamic linking, or function variables) consist of a code pointer and a context pointer, which might be loaded into a0 and a1, and then indirect call made to a0. Thus a1 is passed by the caller to callee as the context pointer for the callee. For a top-level function in a compilation unit, this is typically the global variables and dynamic links. For a method, this is the object (the self pointer). For a function lexically nested within a parent function, this a pointer to its lexical parent stack frame. Compilers may (and are encouraged to) use skip parent, grandparent, etc. frames as the context pointer for a callee when these are not needed by the callee (i.e. when the callee does not reference values from its parent’s stack frame), but to only use as a context pointer the first ancestor required. This allows the callee to make the fewest series of loads to get to the lexically scoped variables or globals. Functions with a lineal descendant that reference both its variables and variables of ancestor variables, store their a1 in a stack frame offset known to the descendant.
While a1 is unpreserved in the general calling convention, the compiler is free to know which functions in the same compilation unit happen to preserve a1. This may allow calls between siblings in the compilation unit to simply leave a1 constant from caller to callee. In particular the top-level functions of a compilation unit all use a1 to reference global variables, and need not save and restore it when calling each other, if a1 is known to be preserved.
a15 is the program stack pointer. Functions typically only use a frame pointer when variable allocation is done on the stack, or the stack frame is large. a14 is typically used as a frame pointer in these cases.
Stacks grow upward, so the callee allocates its frame by adding to a15, using the ALLOC instruction (which places the size into the new tag for bounds checking).
s0 to s9 are used for passing floating-point arguments to functions.
s0 to s9 are used for returning floating-point values from functions (though most languages only require one or two return values).
s10 to s15 are preserved by the callee (i.e. saved and restored on entry and exit if used). The callee need not preserve s0 to s9.
b1 to b15 are not used for argument passing and return values and are not preserved by callees. Booleans are passed as integers. A caller that wants to preserve BR values may store them to memory or to a preserved SR, and later restore.
Code segments may be protected execute-only (X), execute and read (RX), or execute, read, and write (RWX, e.g. for dynamic code generation). Code segments mapped from files would typically not be writeable. Any static data in the code segment is typically copied into the heap on loading and the pointer to this data is passed to functions in the code segment as the context pointer (aka cp or a1). Dynamic linking is accomplished by loading the code pointer and context pointer into a0 and a1 and calling indirect through a0.
COUNT, CARRY, VL, VMs, and VRs are not preserved by callees.

Direct Mapping and Paging

Translation of local virtual addresses to system interconnect addresses is typically performed in a single processor cycle in one of several L1 TLBs, which may be supplemented with one or more L2 TLBs. If the TLBs fail to provide translate the address, then the processor performs a more lengthy procedure, and if that succeeds, then the result is written into the TLBs to speed later translations. This TLB miss procedure determines the memory architecture. As described above, SecureRISC0 uses both segmentation and paging in its memory architecture. The first step of a TLB miss is therefore to determine a segment descriptor and then proceed as that directs. One way of thinking about SecureRISC0 segmentation is that is a specialized first-level page table that controls the subsequent levels, including giving the page size and table depth (derived from the segment size).

SecureRISC0 segments may be directly mapped to an aligned system virtual address range equal to the segment size, or they may be paged. Direct mapping may be appropriate to I/O regions, for example. It consists of simply changing the high bits (above the segment size) of the local virtual address to the appropriate system virtual address bits, and leaving the low bits (below the segment size) unchanged.

Paging

Paging in SecureRISC0 takes advantage of segment sizes to be more efficient than in some ISAs and also supports multiple page sizes. The proposal for SecureRISC0 here is that each segment has a page size that is used for all subsequent levels, but this could be generalized to allow a programmable size at each level at the cost of complexity in hardware. My current thoughts on those page sizes are 4 KiB, 16 KiB, and 1 MiB, but which sizes would eventually be supported is a matter for evaluation. Only the last level page size affects the TLB in many micro-architectures, so page size at earlier levels is primarily a question of complexity in the hardware table walk. Supporting multiple page sizes in TLBs is costly, and should be done in a limited way, and I am concerned even about proposing three sizes.

Aside: 1024 words (which became 4 KiB with byte addressing) was frequently chosen as the page size back in the 1960s as the trade-off between the memory wasted by allocating in page units and the size of the page table. This size has been carried forward with some variation for decades. The trade-offs are different in 2020s from the 1960s, so it deserves another look. Even the old 1024 words would suggest a page size of 8 KiB today. Today, with much larger address spaces, multi-level page tables are typically used, often with the same page size at each level. The number of levels, and therefore the TLB miss penalty is then a factor in the page size consideration that did not exist in the 1960s.

Aside: RISC‑V’s Sv39 model has three page sizes for TLBs to match: 4 KiB, 2 MiB, and 1 GiB. Sv48 adds 512 GiB, and Sv57 adds 256 TiB. The large page sizes were chosen as early outs from multi-level table walks, and don’t necessarily represent optimal sizes for things like I/O mapping or large HPC workloads. Multiple sizes does complicate TLB matching.

Address Space Identifiers (ASIDs) for TLB Sharing

TLBs introduce one other complication. Typically when the supervisor switches from one process to another, it changes the segment and page tables. Absent an optimization, it would be necessary to flush the TLBs on any change in the tables, which is both costly in the cycles to flush and the misses that follow reloading the TLBs on memory references following the switch. Most processors with TLBs introduce a mechanism to reduce how often the TLB must be flushed, such as the Address Space Identifier (ASID) found in the MIPS translation hardware. The ASID is stored in the TLB, and when the supervisor switches to a new process, it either uses the process’ previous ASID, or assigns a new one if the TLB has been flushed since the last time the process ran. This allows its previous TLB entries to be used if they are still present in the TLB, but also avoids the TLB flush. When the ASIDs are used up, the TLB is flushed, and then ASID assignment starts fresh as processes are run. For example, a 5‑bit ASID would then require a TLB flush only when the 33rd distinct process is run after the last flush. The supervisor often uses translation and paging for its own data structures, some of which are process-specific, and some of which are common. To not require multiple TLB entries for the supervisor pages common between processes, a Global bit was introduced in the MIPS and other TLBs. This bit caused the TLB entry to ignore the ASID during the match process; such entries match any ASID. This whole issue occurs a second time when hypervisors switch between multiple guest operating systems, each of which thinks it controls the ASIDs in the TLB. RISC‑V for example introduced a VMID controlled by the hypervisor that works analogously to the ASID.

SecureRISC0 needs an ASID mechanism and a way to ignore for the same reason as in other ISAs. The question is whether this mechanism needs to be generalized, just as rings are a generalization of of supervisor and user mode. I propose one such possible generalization with eight possible sharing opportunities, but whether this is required may be reevaluated. Perhaps SecureRISC0 will revert to a simple Global bit or just ASID=0 to mean Global. There is no particular reason to choose eight. Below is the mechanism proposed. Again, we expect that various service levels in the system will have some segments common to all of the service levels that they support, and that these should require only a single TLB entry, but that other segments might be change their translation for each supported service level.

Segment Descriptor Table Pointer Registers and ASIDs

The simplest implementation for a Segment Descriptor Table (SDT) is to have a single Segment Descriptor Table Pointer (SDTP) register and use a Global bit in Page Table Entries (PTEs). My alternative ASID generalization is to groups segments into eight groups (SG), and give each group its own SDT, as addressed by eight SDTP registers. These eight registers are then the zero level table, followed by the chosen Segment Descriptor Table (the first level), followed by zero to four levels of page table. Since the registers are not in memory, there are one to five levels of memory tables to walk starting with the Segment Descriptor Table. The segment size in the SDT allows the length of the walk to be per-segment, so most code segments (e.g. shared libraries) will have only one level of page table, but a process with a segment for weather data might require two or three levels (and might use a large page size as well to minimize TLB misses). Some hypervisor segments might be direct-mapped, and require only the SDT level of mapping. In addition if the hypervisor is not paging the supervisor, it might direct map many supervisor segments.

Here are the details. After a TLB miss, the processor starts by using the 3 high bits of the segment field of the address to pick one of eight Segment Descriptor Table registers (sdtp[0] to sdtp[7]). The low 10 bits of the segment field are then an index into the table at the system virtual address in the specified register. The size field of the sdtp registers is used to bounds check the low 10 bits of the segment number before indexing, which allows each portion of the Segment Descriptor Table to be 0, 256, 512, 768, or 1024 entries in multiples of 256 entries (4 KiB to 16 KiB in multiples of 4 KiB). A size field of 0 disables the segment group; otherwise, the check is that svaddr_57..56 < satp[svaddr_60..58]_11..9. If the bound check succeeds, the doubleword Segment Descriptor Entry is read from (satp[svaddr_60..58]_63..12 ∥ 0¹²) | (svaddr_57..48 ∥ 0⁴) and this descriptor is used to bounds check the segment offset, and to generate a system virtual address. When TLB entries are created to speed future translations, they use the Address Space Identifier (ASID) specified in bits 8..0 the selected sdtp.

This method can be used to provide the functionality of two levels of other architectures (i.e. supervisor common using Global=1 and per-process using Global=0). A SecureRISC0 supervisor might simply use 256-1024 segments for supervisor common (with ASID=0), and 256-1024 segments the other segments for per-process mappings with dynamically assigned ASIDs as they are run. Such a system might set sdtp[7] at initialization, change sdtp[0] on process switch, and leave the other six groups unused (size=0).

Segment Descriptor Table Pointer registers are only readable and writable by ring 6. Other rings must use ring 6 calls to read and write these registers.

Segment Descriptor Table Pointers
71	64	63	12	11	9	8	0
240		svaddress_63..12		size		ASID
8		52		3		9

Segment Descriptor Entry Word 0
71	64	63	33	32	31	30	29	28	24	23	21	20	18	17	15	14	12	11	10	9	8	7	6	5	0
240		0		G1		G0		PS		0		R3		R2		R1		0	C	P	X	W	R	size
8		31		2		2		5		3		3		3		3		1	1	1	1	1	1	6

Fields of Segment Descriptor Entries (SDEs)

Field(s)

Width

Description

size

Segment size is 2^size bytes. Value 0 indicates an invalid segment (or should there be a V bit?). Values 1..11 and 62..63 are reserved.

Read permission

Write permission

Execute permission

Pointer permission (pointers with segment numbers are permitted)

CHERI Capability permission

R1, R2, R3

Ring brackets as described elsewhere.

Page size:

0	4	KiB
1	reserved
2	16	KiB
3 to 7	reserved
8	1	MiB
9 to 30	reserved
31	direct mapped, unpaged

Generation number of this segment for GC.

Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this segment traps and software lowers the G1 field. This feature is turned off by setting G to 3.

Segment Descriptor Entry Word 1
71	64	63	12	11	0
240		svaddress_63..12		0
8		52		12

The interpretation of the system virtual address in SDE word 1 depends on the PS field in SDE word 0. For direct mapping (PS=31), it is simply the high bits of the System Virtual Address to combine with the segment offset, and bits size-1..0 must be zero. For paging, it is the address of the first-level page table, and bits PS+11..0 must be zero.

Local Virtual Address Direct Mapping

For direct mapping, the segment mapping consists of:

Checking that the offset is not out of bounds for segments < 2⁴⁸ bytes or clearing bits 60..size for segments ≥ 2⁴⁸ bytes.
Checking that the mapping is aligned to the segment size.
Oring the offset with the mapping. The two checks above ensures that the OR is never sees two ones in the same bit position.

For segments ≤ 2⁴⁸ bytes, the offset is simply bits 47..0 of the local virtual address, and so the first check is that bits 47..size are zero, or equivalently that svaddr_47..0 < 2^size. For segments > 2⁴⁸ bytes, the offset extends into the segment number field, and no checking need be done during mapping (such sizes are however used during checking address arithmetic), but bits 60..size must be cleared before oring. The second check is that bits size−1..0 of the mapping are zero. The supervisor is responsible for providing the appropriate values in the Segment Descriptor Entries for each portion of segments > 2⁴⁸ bytes. Thus paging does not need to handle segments larger than 2⁴⁸ bytes (the SDT for such segments is in effect the first level of the page table).

Local Virtual Address Paging

When paging is used, the page tables can be one to four levels deep, with each level after the first using the same page size. The first level uses some fraction of the specified page size depending on the segment size. The hypervisor and supervisors are free to allocate only as much memory as needed for the first level page table. The following tables provide examples of how the local virtual address is used to index levels of the page table for several page and segment sizes. In the figures below, the 13‑bit segment number is split into a 3‑bit segment group (SG) number (used to pick the SDTP register) and the offset (SEG) within that group.

Local Virtual Address with 4 KiB page size and 2⁴⁸ segment size — 4‑level page table
60	58	57	48	47	39	38	30	29	21	20	12	11	0
SG		SEG		V1		V2		V3		V4		offset
3		10		9		9		9		9		12

Local Virtual Address with 4 KiB page size and 2³⁰ segment size — 2‑level page table
60	58	57	48	47	30	29	21	20	12	11	0
SG		SEG		0		V1		V2		offset
3		10		18		9		9		12

Local Virtual Address with 16 KiB page size and 2⁴⁸ segment size — 4‑level page table
60	58	57	48	47	46	36	35	25	24	14	13	0
SG		SEG		V1	V2		V3		V4		offset
3		10		1	11		11		11		14

Local Virtual Address with 1 MiB page size and 2⁴⁸ segment size — 2‑level page table
60	58	57	48	47	37	36	20	19	0
SG		SEG		V1		V2		offset
3		10		11		17		20

The format of a segment page table is multiple levels, each level consisting of 72‑bit words with integer tags in the following format:

Page Table Entry (PTE)
71	64	63	12	11	8	7	6	5	4	3	2	1	0
240		svaddress_63..12		S		G		D	A	X	W	R	V
8		52		4		2		1	1	1	1	1	1

Segments are meant as the unit of access control, but including Read, Write, and Execute permissions in the PTE might make ports of less aware operating systems easier.

Fields of Page Table Entries (PTEs)
Field(s)	Width	Description
V	1	Valid: 0 ⇒ invalid, bits 63..1 available for software 1 ⇒ valid, bits 63..1 as described below
R	1	Read permission
W	1	Write permission
X	1	Execute permission
A	1	Accessed: 0 ⇒ trap on any access (software sets A to continue) 1 ⇒ access allowed
D	1	Dirty: 0 ⇒ trap on any write (software sets D to continue) 1 ⇒ writes allowed
G	2	Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this page traps and software lowers the G field. This feature is turned off by setting G to 3.
SW	4	For software use
svaddress_63..12	52	For last level of page table, this is the translation For earlier levels, this is the pointer to the next level

System Virtual to System Interconnect Address Mapping

After 61‑bit Local Virtual Addresses (lvaddrs) are mapped to 64‑bit System Virtual Addresses (svaddrs), these 64‑bit System Virtual Addresses are mapped to 64‑bit System Interconnect Addresses (siaddrs). This mapping is similar, but not identical to the mapping above as it starts with a 14‑bit region number rather than one of eight 13‑bit segment numbers. There is one such mapping set by the hypervisor for the entire system using a Region Descriptor Table (RDT) at a fixed system address. The RDT may be hardwired, or read-only, on read/write by the hypervisor. For the maximum 16,384 regions, with 16 bytes for a RDT entry, the maximum size RDT is 256 KiB in size. A system configuration parameter allows the size of the RDT to be reduced when the full number of regions is not required (which is likely).

The format of the Region Descriptor Entries is a simplified version of Segment Descriptor Entries as shown below.

Region Descriptor Entry Word 0
71	64	63	26	25	14	13	12	11	10	6	5	0
240		0		NDA		C	W	R	PS		size
8		38		12		1	1	1	5		6

Region Descriptor Entry Word 1
71	64	63	4	3	0
240		System Interconnect Address		0
8		60		4

The format of a region page table is multiple levels, each level consisting of 72‑bit words with integer tags in the same format as PTEs for local virtual to system virtual mapping, except there is no X or G fields.

Additional fields in the RDE may be useful. A bit indicating whether memory is tagged or not may be useful (tag-aware ports would provide a 240 tag on reads and check that the tag is 240 on writes). Another field might indicate whether encryption is used for the region, and if so, which of the port’s keys to use.

TLB Flushing

Reading Segment Descriptor Entries (SDEs) from the Segment Descriptor Table (SDT) and Region Descriptor Entries (RDEs) from the Region Descriptor Table (RDT) would typically be done through the L2 Data Cache. Since the L2 Data Cache is coherent with respect to this and other processors in the system, it is possible that L2 Data Cache might note that the TLB contains entries from the line, and send an invalidate to the TLB when the L2 line is invalidated. This might avoid the need for some TLB flushes. However, this requires the L2 to store the TLB location, which might require 8 bits per L2 tag. It is unclear whether this is worthwhile.

Region Protection

Ports into the system interconnect (Initiators) are limited in which regions they are permitted to access. The exact mechanism is TBD.

One possibility is that each Initiator is programmed by the hypervisor with two non-discretionary access control (NDAC) sets. One is for the Initiator’s TLB accesses, and the other is for accesses made by agents that the Initiator services. Non-discretionary access control is also stored as part of the Region Descriptors and cached in the Initiator’s TLB. The Initiator tests each access and rejects those that fail. Read access requires RegionCategories ⊆ InitiatorCategories and Write access requires RegionCategories = InitiatorCategories. For example, the Region Descriptor Table and the page tables those reference might have a Hypervisor bit that would prevent reads and writes from anything but Initiator TLBs. Processors would have non-discretionary access control sets per-ring. This would allow the same system to support multiple classification levels, e.g. Orange Book Secret and Top-Secret, with Top-Secret peripherals able to read both Secret and Top-Secret memory, but Secret peripherals denied access to Top-Secret memory.

Encryption might also be used to protect multiple levels of data in a system. For example, if Secret and Top-Secret data in memory are encrypted with different keys, and Secret Initiators are only programmed with that encryption key, then reading Top-Secret memory will result in garbage being read and writing Top-Secret data from a peripheral to Secret memory will result in that data being garbage to a processor or another peripheral with only the Secret key.

Because encryption results only in data being unintelligible, it is more difficult to debug. It may be desirable to employ both NDA sets and encryption.

Memory Encryption

An optional system feature of RDEs is to specify that the contents of the memory of the region should be protected with data at rest encryption. A separate table (perhaps in a secure enclave) would give the symmetric encryption key for encrypting and decrypting data transferred to and from the region and the system interconnect address would be used as the tweak. An obvious possibility is a 144‑bit block size cipher (e.g. a variant of AES based upon 9‑bit S‑boxes) used in Galois/Counter Mode (GCM), resulting in a 144‑bit authentication code, which would be stored in memory with the block. For SecureRISC0, with a cache line size of 8 words of 72 bits, this results in 576‑bit entities for data at rest protection, which becomes 720 bits with the authentication code, or eight 90‑bit words, which would be ECC protected with 8 check bits, producing a 98‑bit memory word. This would be an unusual width for standard DRAMs, and 9 ECC bits per 180 would also be unusual. Instead consider the 576 bits to be 9 words of 64 bits, use a more standard 128‑bit block cipher (e.g. standard AES) nine times, add the 128‑bit authentication, resulting in 704 bits, or eight words of 88 bits. Adding 8 bits of ECC results in 96 bits per memory word, which might use three 32‑bit or 64‑bit DRAM modules. Reads of encrypted memory would compute the 576 GCM xor bits during the read latency, resulting in a single xor when the data arrives at the system interconnect port boundary (either 96, 192, 384, or 576 bits per cycle). This xor would be much less time than the ECC check. Regenerating ECC for the decrypted data for writing into the L2 cache can be done by also precomputing the 64 bits to xor with the 8 ECC codes. Only if an ECC error is detected and corrected is it necessary to recompute the ECC before writing into the L2 cache. Writes would incur the GCM computation latency (primarily nine AES computations). Because the memory width and interconnect fabric would be sized for encryption, the only point in not encrypting a region would be to reduce write latency or to support non-block writes (it being impossible to update the authentication code when without doing a read, modify, write).

An Example Micro-architecture

	Structure	Description
Basic Block Descriptor Fetch
	Predicted PC	64‑bit lvaddr and ring
	Predicted COUNT	64‑bit integer
	Predicted CSP	64‑bit lvaddr and ring
	L1 BB Descriptor TLB	32 entry, 8‑way set associative, mapping lvaddr_61..12 to siaddr_63..12 in parallel with BB Descriptor Cache, filled from L2 Descriptor/Instruction TLB
	L2 BB Descriptor TLB	256 entry, 8‑way set associative, filled from L2 Data Cache
	BB Descriptor Cache	32 KiB (4096 descriptors), 8‑way set associative, 64‑byte line size, 8‑byte read, 64‑byte write, lvaddr_11..3 index, ?siaddr_35..12 tag?, 1.5 cycles latency, 2 cycles to predicted PC, filled from L2 Descriptor/Instruction Cache on miss and by prefetch
	Next Descriptor Index Predictor	32×10+12, direct mapped lvaddr_7..3 index, lvaddr_19..8 tag, 1 cycle to predicted BB Descriptor Cache index, most recent flow change hits from BB Descriptor Cache
	Return Address Prediction	64-entry (512 B)
	Branch Predictor	~16 KiB BATAGE
	Indirect Jump/Call Predictor	~16 KiB ITTAGE?
	BB Fetch Output	8‑entry BB Descriptor Queue of PC, BB type, fetch count, fetch siaddr_63..2, instruction start mask, prediction to check
Instruction Fetch
	L1 Instruction Cache	64 KiB, 4‑way set associative, 64‑byte line, read, write siaddr_13..4 index, siaddr_63..14 tag, 2-cycle latency, use 0-2 times per basic block descriptor, so 0 or 2-3 cycles for entire BB instruction fetch, filled from L2 Descriptor/Instruction Cache on miss and prefetch, experiment with prefetch on BB descriptor fill 0 fetches required if the previous 64B fetch covers the current one
L2 Fetch
	L2 Combined Descriptor/Instruction Cache	512 KiB, 8‑way set associative, 64‑byte line, read, write, siaddr_15..6 index, siaddr_63..16 tag, filled from system interconnect or L3 on miss and prefetch, evictions to L3
	Instruction Fetch Output	32‑entry Instruction Queue of 50‑bit decoded instructions (16‑bit and 32‑bit instructions expanded)
AR Execution Unit
	PC, CSP, COUNT	Committed values
	Register renaming for ARs	16×6 4‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical AR numbers and assigning d from AR free list.
	Register renaming for XRs	16×6 8‑read, 4‑write register file mapping 4‑bit a, b, fields to physical XR numbers and assigning d from XR free list.
	Register renaming for BRs	16×6 6‑read, 2‑write register file mapping 4‑bit a, b, c fields to physical BR numbers and assigning d from BR free list.
	Register renaming for SRs	16×6 8‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical SR numbers and assigning d from SR free list.
	(VRs are not renamed)
	AR physical register file	64×144 (+ parity) 6‑read, 4‑write
	XR physical register file	64×72 (+ parity) 6‑read, 4‑write
	L1 Data TLB	64 entry, 8‑way set associative, mapping lvaddr to siaddr, filled from L2 Data TLB
	L2 Data TLB	256 entry, 8‑way set associative, filled from L2 Data Cache
	L1 Data Cache	32 KiB, 4‑way set associative, 64‑byte line, 16‑byte read, 64‑byte write, lvaddr index, siaddr tag, write-thru, filled from L2 Data Cache on miss or prefetch
	Return Address Stack Cache	64-entry (512 B), 64‑byte line size, no tags, fill and writeback to L2 Data Cache, subset and coherent with L2 Data Cache
	L2 Data Cache	512 KiB, 8‑way set associative, 64‑byte line, read, write, siaddr_15..6 index, siaddr_63..16 tag, write-back, filled from system interconnect or L3 on miss or prefetch, eviction to L3
	AR Engine Output	64‑entry BR/SR/VR operation queue
SR/VR Execution Unit (tends to run about a L2 Data Cache latency behind the AR Execution Unit)
	BR physical register file	64×1 6‑read, 2‑write
	SR physical register file	64×72 (+ parity) 8‑read, 4‑write
	VR register file	16×72×128 (+ parity) 4‑read, 2‑write
Combined Fetch/Data
	System virtual address TLB	128 entry, 8‑way set associative, mapping system virtual addresses to system interconnect addresses (maintained by hypervisor)
	L3 Eviction Cache serving L2 Instruction and L2 Data caches	8 MiB, 8‑way set associative, 64‑byte line size, non-inclusive, plus 8‑way set associative directory for sub caches, filled from evictions from L2 Instruction and Data caches

Questions and Things Still Undecided

Should system virtual addresses (svaddrs) and system interconnect addresses (siaddrs) be 72 bits instead of 64 bits? The system interconnect fabric will often be this wide for data, and if data and addresses share the same wires, this would be natural. On the other hand, typically these days data and address are on separate wires. Seventy-two bit addresses supports larger systems, but it makes it more difficult to generate these addresses in software to send to I/O devices since software has to use the tag field, and some tag values are reserved. It is fairly easy to use for processors using lvaddr to svaddr translation, as it just means providing a larger svaddr field in the SDE or PTE. There are not many extra bits in the PTE however, at least for the 4 KiB page size. At this point, I am keeping 64‑bit addresses, but I mention the potential of 72 for future consideration.
The header/trailer size words for blocks may be useful to memory allocators. One question is whether bits 2..0 (or 3..0) must be 0, or whether these bits should be ignored, to make them available to the allocator. It is also possible to provide a different tag to be used by allocators for freed blocks, or perhaps tag 255 would be used for this purpose. This would cause references to the header word of a freed block to trap, detecting the read-after-free error. It would not help with write-after-free without some further feature.
When does the size field of ARs get set when a pointer with tag 136 is loaded? Does the load itself indirect, or is another instruction required? I lean towards adding a LSIZE instruction, which compilers will only use for things variable-size or large.
It is possible to eliminate the COUNT register with a field in the BB descriptor to specify which instruction of the loop header writes the loop iteration count in an AR, and have the AR engine send that to the BB engine when it completes. The compiler would then keep the loop iteration count in an AR. One issue is that there is no room for a 4‑bit field in the BB descriptor. Or one could keep keep a single bit in the BB descriptor and use a special opcode to provide the value, but no special register and no instruction to decrement it, and have the compiler decrement an ARs to check the loop count prediction. There are only 16 ARs, which makes having the dedicated COUNT register relieve some register pressure. But COUNT is ugly.
Can the size word (normally tag 254 at pointer − 8 of a tag 136) be used with a different tag for Lisp arrays that grow by giving a pointer to the new array? This could be implemented as a special instruction following LSIZE that either copies AR[a] or loads from the address in AR[a] depending on the tag.
Should pointers include a Write permission bit? Other permissions? A read-only pointer might be used for reference parameters for example. Probably SecureRISC0 should leave this sort of checking to CHERI capabilities, which are much wider and therefore better suited, to support pointer permissions.
Can rings handle their own exceptions? This would be desirable for using exceptions for things like Lisp/Julia bignum arithmetic. Perhaps there could be 8 exception handler address registers, one for each ring. Exception generate a call to that address. Perhaps the registers are only accessible by ring 6, but calls could allow ring 6 to write on behalf of code that wants to handle its own exceptions.
Should Page Size be per-segment, or per-level? Currently it is per-segment. Per-level is more complicated, but multiple pages sizes are already complicated.
Need to think about interrupts. Come up with a general mechanism for multiple levels that works for NMI, high-priority interrupts, and lower priority interrupts (so 3 hardware levels, many software levels), where NMI can interrupt high-priority, and high-priority can interrupt low-priority. Also think about interrupt virtualization (supervisor vs. hypervisor).
My current thought is the hypervisor environment does not need to be identical to the supervisor environment, despite what most hypervisors do. Most machines start out with a supervisor environment and then virtualize that, but retain the original supervisor environment for backward compatibility. SecureRISC0 could instead require the hypervisor from the start, so that even a non-VM system would have both a hypervisor (very minimal) and a supervisor (somewhat similar to RISC‑V M-mode and OpenSBI being the basis for supervisors). This seems to require less duplication to me and it is the direction I am headed, but is there a problem I am missing here?
How should Reset and Boot work? Is an unmapped mode of operation required? Is it sufficient to have segment 0 be defined as the boot segment and have a hardwired direct map and not have a mode for unmapped? Similar for REGION 0 in the system virtual translation? This is more akin to the MIPS address space, with its hardwired unmapped space.
Consider adding non-faulting loads for further branch avoidance and for compiler speculation. This requires a new bit or tag to indicate the invalidity of the loaded value, e.g. AV[a] is the valid bit for AR[a], SV[a] is the valid bit for SR[a], and BV[a] is the valid bit for BR[a]. To use, the compiler splits a LA/L instruction into LAV/LV and CHKA/CHK and hoists then LAV/LV an earlier basic block but leaves CHKA/CHK instructions in the original basic block to take the exception based on AV/SV bits. Computation instructions propagate the valid bits (e.g. XR[d] ← XR[a] op XR[b] also does XV[d] ← XV[a] & XV[b]) so that computations can be hoisted too. Branching or trapping on an invalid value traps (there is an implicit CHKA or CHK).
Could generalize the valid bit idea to a non-discretionary access set so that categories flow with data?
Do floating-point 8‑bit, 16‑bit, 32‑bit loads and stores to SRs transfer bits or convert to a register format? Should SRs have 80 bits of data instead 64 bits of data for IEEE extended? This would also allow extra guard bits for integer DSP.
Should there be doubleword loads and stores to the ARs and SRs? This can be very useful in many situations for both performance and code size (one example is register save and restore on function entry and exit). It does complicate microarchitectures quite a bit, especially with register renaming, where the feature potentially introduces an extra register file read and write port. For now, this is left unsettled, but an effort should be made to keep various register usage pair-aligned. Note that the CHERI capability load and store of doublewords to ARs are not problematic, since an AR pair is not required.
I started out thinking stacks should grow upward. Downward stacks mean allocating a large segment for maximum growth potential, which is inefficient for paging. Upward growing segments can expand by simply changing the size field and expanding the page table. Downward stacks are more natural for stack frame allocation and tag checking, but I introduce the ALLOC instruction to increase the SP by the size of the stack frame (from size field of the SP register) and to install a tag giving the size of the new frame, making upward stacks workable.
There are currently enough unused tag values that pointers to even number of words from 128 to 256 could be handled by 64 more tags. Going further, one could have tags 1 to 31 for that number of words, tags 32 to 63 representing 16 to 31 doublewords (32 to 62 words), tags 64 to 95 representing 16 to 31 quadwords (64 to 124 words), and so on, up to tags 160 to 191 representing 16 to 31 units of 32 words (512 to 992 words). Would this be useful? It wastes as much as 6% of memory to round up allocation sizes in this way in return for delaying the point at which the memory bounds must be read from the pointer − 8.
Should pointers with tag 136 be doubleword aligned?
How should the last BB of a bage handle fall-through? Add a bit to indicate that fall-through is the first BB of the following bage?
Should instruction words have their own tag, or just an integer tag? If their own tag, do various load instructions work as if an integer tag is present? The only purpose would be to check the offset field of BB descriptors, which is not terribly important.
Could non-discretionary access control be used to implement generational garbage collection?
Do Segment Descriptor Entries need Accessed and Dirty bits?
CHERI capabilities cannot live in SRs, because SRs are only 72 bits wide. Loading them into the SRs would probably cause an exception, which means that capability-aware memcpy would be required. Is this a problem?
The doubleword load/store instructions to ARs (needed for save/restore and context switch) can be used by CHERI to load and store doubleword capabilities. How do ring numbers in pointers interact with CHERI? Can CHERI reuse the ring number field for other purposes?

Tag Summary

Tag	Use
0	Null pointers
1..128	Pointer to 1..128 words
129..135	Pointer to 1..7 bytes
136	Pointer to N words, with N stored at pointer − 8, and −N stored at pointer + N×8
137	Unsized Pointer for C++
138..191	Reserved
192	Pointer to Basic Block Descriptor
193..199	Reserved
200	CHERI Capability word 0
201..223	Reserved
224	Lisp CONS
225	Lisp Function
226	Lisp Symbol
227	Lisp/Julia Structure
228..229	Reserved
230	Lisp Array
231	Lisp Vector
232	Lisp String
233	Lisp Bit-vector
234	Lisp Ratio, Julia Rational
235	Lisp/Julia Complex
236	Lisp Bigfloat
237	Lisp Bignum
238	128‑bit integer
239	128‑bit unsigned integer
240	64‑bit integer
241	64‑bit unsigned integer
242	Small integer types
243	Reserved
244	Double-precision floating-point
245	8, 16, and 32‑bit floating-point
246..251	Reserved
252	CHERI capability word 1. Bits 143..136 of AR doublword store (used for save/restore and CHERI capabilities)
253	Basic Block Descriptor
254	Size header/trailer words
255	Trap on load or store

		Earl Killian <webmaster at securerisc.org>
SecureRISC0/index.html 2022-08-16