This is an older set of ideas that I later evolved into SecureRISC. For the moment I am calling this older ISA SecureRISC0. It will eventually go away; at the moment I am keeping it for my reference purposes only. If you got here via a search engine, I suggest you follow the link above to look at the current stuff instead.
This document is organized as successive expositions at increasing levels of detail, to give the reader an idea of the motivations and high-level differences from conventional processor architectures, eventually getting down to the detailed definitions that direct SecureRISC processor execution. So if the introductory material seems a little vague, that is because it attempts to sketch an overall context into which the details are later fit.
This document was created to develop old ideas and notes of mine. It is not a complete Instruction Set Architecture (ISA), but only the things I have had time to consider and work on. In addition, while learning about CHERI, a new option I had not previously considered occurred to me, and I created a variant called SecureRISC to develop those ideas in parallel, but that document is only barely started. For the time being, neither SecureRISC0 or SecureRISC is anything but a point of discussion.
The ISA is mostly just ideas at this point. The opcode assignments, instruction specifications are not more than hints. The Virtual memory architecture needs work.
SecureRISC0 began as an exploration of what a security-conscious ISA might look like. Should it or SecureRISC someday turn into something more than an exploration, my intent would be to make it an Open Source ISA, along the lines of RISC‑V.
There is no software (compiler, operating system, etc.) for SecureRISC0. This is a paper-only spec at this point in time.
SecureRISC0 is my attempt to explore my thoughts on a security-conscious Instruction Set Architecture (ISA) appropriate for server class systems, but which with modern process technology (e.g. 5 nm), could even be used for IoT computing given the die area for such a single such processor is a small fraction of one mm2. I start with the assumption that the processor hardware should enforce bounds checking and that the virtual memory system should use older, more sophisticated security measures, such as found in Multics, including rings, segmentation, discretionary and non-discretionary access control. I also propose a new block structured instruction set that allows for better Control Flow Integrity (CFI) and performance.
I feel a comment about Multics is appropriate here. There seems to be an impression among many in the computer architecture world that many Multics features are complex. They are actually simple and general. Computer architecture from the 1980s to present is often an oversimplification of Multics. For example, segmentation in Multics served primarily to make page tables independent of access control, which is a useful feature that has been mostly abandoned in post-1980 architectures. Pushing access control into Page Table Entries (PTEs) puts pressure to keep the number of bits devoted to access control minimal, when security considerations might suggest a more robust set. As another example, RISC‑V has two rings (User and Supervisor), with a single bit in PTEs (the U bit) serving as a ring bracket. Having only two rings means a completely different mechanism is required for sandboxing rather than having four rings and slightly more capable ring brackets. It is true that rings were not well utilized on Multics, but we now have more uses for multiple rings.
The goals for SecureRISC0 in order of priority are:
Non-goals for SecureRISC0 include (this list will probably grow):
Security can mean many things. One of the most important is preventing unassisted infiltration (e.g. through exploiting buffer overflows, use-after-free errors, and other programming mistakes). Another is preventing unintentionally-assisted infiltration (e.g. phishing attacks installing trojans), which may be accomplished through non-discretionary access control. SecureRISC0 is not a comprehensive attempt at security, but addresses the aspects that I think can be improved.
While I expect that non-discretionary access control is critical to computer security, at this point there is relatively little in SecureRISC0’s architecture that enforces this (it is primarily left to software). However, I am still looking for opportunities in this area.
Security, garbage collection, and dynamic typing may appear to be orthogonal, but I see them as synergistic. SecureRISC0 attempts to minimize the impact of programming mistakes in several ways, such as making bounds checking somewhat automatic, making compiler-generated checking more efficient. To address memory allocation error detection however, other techniques are necessary. One possibility is garbage collection (GC), which eliminates these errors, but GC needs to be efficient for this application, hence the goal synergy. Another way to detect some allocation errors is tagging memory so that use after free is detected (unfortunately use-after-reallocation may not be detected with this mechanism). SecureRISC0 targets these goals by what will likely be the most controversial aspect of SecureRISC0: tags on words in memory and registers. The Basic Block descriptors may be more unusual, but I think the reader will come to appreciate this aspect of SecureRISC0 with familiarity (especially given the Control Flow Integrity advantages as a security feature), but the reader may in the end not find memory tags convincing because of the non-standard word size that results. I do not see an alternative, however. Tags simultaneously provide a mechanism for bounding pointers, support use-after-free detection and more efficient Garbage Collection (the best solution to allocation errors), and also happen to support dynamically typed languages.
SecureRISC0’s pointer bounding is however not as general as I would like; it is suited to situations where indexing from a base is used rather than incrementing and decrementing pointers, and so the SecureRISC0 variants are better suited to languages such as Rust, Swift, Julia, Python, or Lisp. Running some C++ code would be possible with bounds checking, but pointer-oriented C++ code would fail bounds checking. I have reserved a tag for C++ pointers, but using these would represent a less secure mode of operation. The supervisor would need to enable on a per-process basis whether C++ pointers can be used; if disabled they would cause exceptions. For example, a secure system might only allow C++ pointers for applications without internet connectivity.
SecureRISC0 does have support for CHERI capabilities, which has been demonstrated to be fairly compatible with C++, at the cost of making pointers two words. There is much more about CHERI in subsequent sections.
The original motivation for block-structured ISAs was Instruction Level
Parallelism (ILP) studies that I did back in 1997 at SGI that showed
that instruction fetch was the limiting factor in ILP. This was before
modern branch prediction, e.g. TAGE, so that result may no longer be
true. The idea was that instruction fetch is like linked list
processing, with parsing at each list node to find the next link. I
wanted to replace linked lists with vectors, but couldn’t figure
out how, and settled for reducing the parsing at each list node. I
still feel that this is worthwhile, but the exact tradeoffs might
require updating older work in this area. The best validation of this
dates from 2007,
when Professor Christoforos Kozyrakis
convinced his PhD student
Dr. Ahmad Zmily
to look at this approach in a PhD thesis. In the introduction of
Block-Aware Instruction Set Architecture
Dr. Zmily wrote,
We demonstrate that the new architecture improves upon conventional
superscalar designs by 20% in performance and 16% in energy.
Such an advantage is not enough on which to foist a new ISA upon the
world, but it encourages me to think that it does provide impetus for
using such a base when creating a new ISA for other purposes, such as
security.
Prior to starting SecureRISC0, my previous experience was with the many ISAs and operating systems. Long after starting my block-structured ISA thoughts, I became involved in the RISC‑V ISA project. RISC‑V is in many ways a cleaned-up version of the MIPS ISA (e.g. minus load and branch delay slots) and it seems likely to become the next major ISA after x86 and ARM. Being Open Source, RISC‑V is has easy to access documentation. As such I have used it for comparisons in the current description of SecureRISC0 and modified some of my virtual memory model to be slightly more RISC‑V compatible (e.g. choosing the number of segment bits to be compatible with RISC‑V Sv48). However, most aspects of the SecureRISC0 ISAs predate my knowledge of RISC‑V and were not influenced by it, except that I found that RISC‑V’s Vector ISA was more developed than my own thoughts (which were most influenced by the Cray-1, which supported only 64‑bit precision).
In 2022 I encountered the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) research effort. I found their work impressive, but I had concerns about the practicality of some aspects. Despite my concerns, I thought that SecureRISC0 might be a good platform for CHERI, so I have extended SecureRISC0 to outline how it might support CHERI capabilities as an exploration. The SecureRISC variant incorporates a new sized pointer format based on ideas from CHERI. This sized pointer is not as capable as a CHERI pointer, but it is 64 bits rather than 128 bits, which has the advantage of size. There is a more detailed discussion of CHERI and SecureRISC0 below.
Some things remain unchanged from other RISCs. Addresses are byte addressed. Like other RISC ISAs, SecureRISC0 is is mostly based upon loads and stores for memory access. Integers and floating-point values have 8, 16, 32, or 64‑bit precision. Floating-point would be IEEE-754-2019 compatible. The Vector ISA will probably be similar to the RISC‑V Vector ISA, but might however use the 48‑bit instruction format to do more in the instruction word and less with vset. Also, there are four explicit vector mask registers, rather than using v0.
Readers will have to decide for themselves whether the proposed virtual memory is conventional because it is somewhat similar to Multics, or unconventional, because it is different from RISC ISAs of the last forty years. A similar comment could be made concerning the register architecture, since it echos an ISA from 1976, but is somewhat different from RISCs since the 1980s.
Much more in SecureRISC0 is unconventional. To prepare the reader to put aside certain expectations, we list some of these things here at a high level, with details in later sections.
a[i]
loads or stores to that location only after checking
that i
is within the bounds specified in the array
pointer. C++ *p++
sort of programming is less
amenable to SecureRISC0 bounds checking and is not the intended
target of this ISA.
for i ← a to b
(where the loop iteration count
is b − a + 1
)
and for i ← a to b step -1
(where the loop iteration count
is a − b + 1
). The loop
may be exited early with a conditional branch; only the loop back
is predicted with the hint.
The Basic Block (BB) descriptor aspect listed above is perhaps the most unfamiliar. To help motivate it for the reader, below are some of the rationale and advantages of this approach.
Contemporary processors have various structures that are created and updated during program execution to improve performance, such as Caches, TLBs, Branch Target predictors (BTB), Return Address predictors (RAS), Conditional Branch predictors, Indirect Jump and Call predictors, prefetchers, and so on. In SecureRISC0 one of these is moved into the ISA for performance and security. In particular the BTB becomes a Basic Block Descriptor Cache (BBDC). The BBDC caches lines of Basic Block Descriptors that are generated by the compiler, in a separate section from the instructions. I have also sought to the Return Address predictor more cache-like, and build in some additional ISA support for loop prediction.
fall throughto subsequent descriptors, but each has a pointer to the instructions to fetch, and so the instruction blocks of a bage could simply be sorted by frequency, placing the hottest first and the coldest last, or some similar arrangement*, all without introducing new instructions or changing anything other the pointers in the descriptors.
I started with the assumption that pointers are a single word, which are
expanded based on the 8-bit tag to a base and size when loaded into the
doubleword (144‑bit) Address Registers
(ARs). The address and size supports
pointers giving the base address and size, where indexing using the
pointer checks the index value against the size. This supports
programs oriented toward a[i]
pointer usage, but not
C++ *p++
pointer arithmetic.
In contrast, the University of Cambridge
Capability Hardware Enhanced RISC Instructions (CHERI)
Project started with the assumption that capability pointers are four
words (including lower and upper bounds, the pointer itself, and
permissions and object type), and invented a compression technique to
get them down to two words. SecureRISC0 can support CHERI by using its
128‑bit AR
load and store instructions to transfer capabilities to and from the
144‑bit ARs, and therefore be
able to accommodate either singleword or doubleword pointers. Support
for the CHERI bottom and top decoding, its permissions, and its
additional instructions would be required. The CHERI tag bit is
replaced with two SecureRISC0 reserved tag values (one tag value in word
0, another in word 1). I would expect languages such as Julia and Lisp
would prefer singleword pointers, so supporting both singleword and
doubleword pointers allows both to to exist on the same processor
depending on the instructions generated by the compiler.
Unlike CHERI, SecureRISC0 pointers have only a size, and not bottom and top values encoded. As a result both SecureRISC0 and SecureRISC are more suited to situations where indexing from a base is used rather than incrementing and decrementing pointers, and so the SecureRISC0 variants are better suited to languages other than C++, primarily ones that emphasize array indexing over pointer arithmetic. My expectation is that running some C++ code on would be possible with bounds checking, but pointer-oriented C++ code would fail bounds checking. I suspect it would be a better target for Rust, Swift, or Julia. I have reserved a tag for C++ pointers, but using these would represent a less secure mode of operation. The supervisor would need to enable on a per-process basis whether C++ pointers can be used; if disabled they would cause exceptions. For example, a secure system might only allow C++ pointers for applications without internet connectivity.
Tagged memory words are separable from other aspects of SecureRISC0, such as the Multics aspects and the Basic Block descriptor aspects. One could imagine a version of SecureRISC0 without the tags and a 64‑bit word (72 bits with ECC in memory). Even in such a reduced ISA—call it SemiSecureRISC—I would keep the AR/XR/SR/VR model. SemiSecureRISC is still interesting for its performance and security advantages, but I do not plan to explore it at this time. There is also the possibility of combining SemiSecureRISC with CHERI and its 1‑bit tag, since the CHERI project has done a lot of important software work. Call such an ISA BlockCHERI. I suspect the CHERI researchers would say that the only advantage in BlockCHERI would be the performance advantage of the Block ISA and the AR/XR/SR separation, with the ARs specialized for CHERI capabilities, and the XRs/SRs for non-capability data. My primary thought on the BlockCHERI is that the difference between a 65‑bit memory (73 bits with ECC) and 72‑bit memory (80 bits with ECC) may find that 7 extra bits may be put to good use.
One could imagine variants of SecureRISC0 that have only some of its features:
Name | Block ISA | Segmentation | Rings | Tags | CHERI | Word | Pointer |
---|---|---|---|---|---|---|---|
SecureRISC0 | ✔ | ✔ | ✔ | ✔ | ✔ | 72 | 72/144 |
SemiSecureRISC | ✔ | ✔ | ✔ | 64 | 64 | ||
BlockRISC | ✔ | 64 | 64 | ||||
BlockCHERI | ✔ | ? | ? | ✔ | 65 | 130 |
As I indicated earlier, I don’t think that BlockRISC is sufficient in itself to justify a new ISA. I am concentrating on the full package.
I need to think more carefully about I/O in a SecureRISC0 system. Certainly some I/O will be done in terms of byte streams transferred via DMA to/and from main memory (e.g. DRAM). Such I/O if directed to tagged main memory writes the bytes with an integer tag. Similarly if processors use uncached writes of 8, 16, 32, or 64 bits (as opposed to 8‑word blocks) to tagged memory, the memory tag must be changed to integer. Tag-aware I/O of 8‑word units exists and may be for paging and so forth. It may be that a general facility for reading tagged memory including the tag as stream of 8‑bit bytes could be provided along with a cryptographic signing, and for writing such a stream back with signature checking will be useful.
Ports onto the system interconnect fabric will have to have rights and permissions assigned by the hypervisor, and perhaps hypervisor guests. This needs to be worked out.
Being able to support user-mode I/O would be desirable, but it seems difficult to make this work, because then the user ring code would be sending its own local virtual addresses to the I/O device for DMA, and so the I/O devices would have to be able to translate user addresses to system interconnect addresses via two-level page tables and user-mode would have to tell the I/O device the page table the supervisor assigned it, which it doesn’t know. At the moment, I have left this unaddressed.
SecureRISC0 is my first set of thoughts on this subject. I have been exploring a variation called SecureRISC that I am exploring. However, that document is barely different from this one at this point. The primary differences are:
SecureRISC0 and SecureRISC encode the size in different ways, and both have tradeoffs:
The SecureRISC pointer format under consideration would change to the following form:
71 | 67 | 66 | 64 | 63 | 61 | 60 | 48 | 47 | 0 | ||||||||||
SS | ring | SG | segment | size | 0 | byte address | |||||||||||||
5 | 3 | 3 | 13 | 21−SS | 2 | 25+SS |
where the boundary between the byte address and size, and the scaling of the size value is determined by SS field for values 0 to 20 (tags 0 to 167). Values 21 is Reserved. Value 22 is tentatively assigned to pointers to regions with header/trailer size words. Value 23 (tags 184 to 191) of SS is used for pointers with no size field (the only size check comes from the segment descriptor size field). Values 24 to 29 (tags 192 to 239) of SS and the ring field are used for dynamic type tagging with a pointer in bits 63..0 or a pointer with a size implied by the tag (e.g. 2 words for the CONS tag). Values 30 and 31 are used for dynamic type tagging with non-pointer data in bits 63..0. The byte address occupies bits SS+24..0 and the size occupies bits 47..SS+26 and has value ptr47..SS+26 ∥ 0SS×2+3. For example:
SS | Address | Size | Granularity (bytes) | ||
---|---|---|---|---|---|
bits | width | bits | width | ||
0 | 24..0 | 25 | 47..26 | 22 | 8 |
1 | 25..0 | 26 | 47..27 | 21 | 32 |
︙ | |||||
19 | 43..0 | 44 | 47..45 | 3 | 241 |
20 | 44..0 | 45 | 47..47 | 1 | 243 |
21 | Reserved | ||||
22 | Reserved | ||||
23 | 47..0 | 48 | n.a. |
Little Endian bit numbering is used in this documentation (bit 0 is the least significant bit). While not a documentation convention, I might as well mention up front that SecureRISC0 is similarly Little Endian in its byte addressing.
domainswhere permissions were specified without nesting. This is straight-forward, until the procedure for evaluating permissions of reference parameters using the privilege of the calling domain is attempted. SecureRISC does not attempt to generalize rings to domains due to this complexity.
What | R1,R2,R3 | seg RWX |
R b | W b | X b | G b | Ring 0 | Ring 1 | Ring 2 | Ring 3 | Rings 4 to 6 |
---|---|---|---|---|---|---|---|---|---|---|---|
User code | 7,1,1 | R-X | [1,6] | - | [1,6] | - | ---- | R-X- | R-X- | R-X- | R-X- |
User stack or heap | 1,1,7 | RW- | [1,6] | [1,6] | - | - | ---- | RW-- | RW-- | RW-- | RW-- |
User return stack | 3,1,7 | RW- | [1,6] | [3,6] | - | - | ---- | R--- | R--- | RW-- | RW-- |
User read-only file | 7,1,7 | R-- | [1,6] | - | - | - | ---- | R--- | R--- | R--- | R--- |
Supervisor driver code | 7,2,2 | R-X | [2,6] | - | [2,6] | - | ---- | ---- | R-X- | R-X- | R-X- |
Supervisor driver data | 2,2,7 | RW- | [2,6] | [2,6] | - | - | ---- | ---- | RW-- | RW-- | RW-- |
Supervisor code | 7,3,3 | R-X | [3,6] | - | [3,6] | - | ---- | ---- | ---- | R-X- | R-X- |
Supervisor heap or stack | 3,3,7 | RW- | [3,6] | [3,6] | - | - | ---- | ---- | ---- | RW-- | RW-- |
Compiler library | 7,0,0 | R-X | [0,6] | - | [0,6] | - | R-X- | R-X- | R-X- | R-X- | R-X- |
Supervisor gates for user | 7,3,1 | R-X | [3,6] | - | [3,6] | [1,2] | ---- | ---G | ---G | R-X- | R-X- |
Sandboxed JIT code | 1,0,0 | RWX | [0,6] | [1,6] | [0,1] | - | R-X- | RWX- | RW-- | RW-- | RW-- |
Sandboxed JIT stack or heap | 0,0,7 | RW- | [0,6] | [0,6] | - | - | RW-- | RW-- | RW-- | RW-- | RW-- |
Sandboxed JIT return stack | 1,0,7 | RW- | [0,6] | [1,6] | - | - | R--- | RW-- | RW-- | RW-- | RW-- |
User gates for sandbox | 7,1,0 | R-X | [1,6] | - | [1,6] | [0,0] | ---G | R-X- | R-X- | R-X- | R-X- |
71 | 64 | 63 | 0 | ||
tag | data | ||||
8 | 64 |
71 | 64 | 63 | 0 | ||
240 | integer | ||||
8 | 64 |
71 | 64 | 63 | 0 | ||
244 | float64 | ||||
8 | 64 |
71 | 64 | 63 | 61 | 60 | 0 | |||
0 | ring | 0 | ||||||
8 | 3 | 61 |
71 | 64 | 63 | 61 | 60 | 3 | 2 | 0 | ||||
1..128 | ring | word address | 0 | ||||||||
8 | 3 | 58 | 3 |
71 | 64 | 63 | 61 | 60 | 0 | |||
129..135 | ring | byte address | ||||||
8 | 3 | 61 |
71 | 64 | 63 | 61 | 60 | 0 | |||
137 | ring | byte address | ||||||
8 | 3 | 61 |
71 | 64 | 63 | 61 | 60 | 3 | 2 | 0 | ||||
136 | ring | word address | 0 | ||||||||
8 | 3 | 58 | 3 |
71 | 64 | 63 | 61 | 60 | 3 | 2 | 0 | ||||
254 | 0 | word count | 0 | ||||||||
8 | 3 | 58 | 3 |
71 | 64 | 63 | 61 | 60 | 3 | 2 | 0 | ||||
254 | 7 | − word count | 0 | ||||||||
8 | 3 | 58 | 3 |
71 | 64 | 63 | 61 | 60 | 3 | 2 | 0 | ||||
192 | ring | BB descriptor word address | 0 | ||||||||
8 | 3 | 58 | 3 |
71 | 64 | 63 | 61 | 60 | 0 | |||
200 | ring | Local virtual address | ||||||
8 | 3 | 61 |
71 | 64 | 63 | 0 | ||
252 | CHERI capability bits | ||||
8 | 64 |
71 | 64 | 63 | 0 | ||
255 | data | ||||
8 | 64 |
As noted earlier, it is useful to provide tags for Common Lisp, Python, and Julia types, even when they are simply pointers to fixed-sized memory, and could theoretically use tags 1..128. This would consume perhaps 10 more tags, as illustrated in the following with the assumption that other types could employ the structure type or something like it (perhaps some of following could do so as well).
Tag | Lisp | Julia | Data use |
---|---|---|---|
1..128 | simple-vector? | Tuple? | Pointer to N words |
129..135 | no dynamic typing use | ||
136 | simple-vector? | Tuple? | Pointer to N words |
137..223 | no dynamic typing use | ||
224 | CONS | Pointer to a pair | |
225 | Function | Pointer to a pair | |
226 | Symbol | Pointer to structure | |
227 | Structure | Structure? | Pointer to structure |
228..229 | no dynamic typing use | ||
230 | Array | Pointer to structure | |
231 | Vector | Pointer to structure | |
232 | String | Pointer to structure | |
233 | Bit-vector | Pointer to structure | |
234 | Ratio | Rational | Pointer to pair |
235 | Complex | Complex | Pointer to pair |
236 | Bigfloat | BigFloat | Pointer to structure |
237 | Bignum | BigInt | Pointer to structure |
238..239 | no dynamic typing use | ||
238 | Int128 | Pointer to pair, −2127..2127−1 |
|
239 | UInt128 | Pointer to pair, 0..2128−1 |
|
240 | Fixnum | Int64 | −263..263−1 |
241 | UInt64 | 0..264−1 | |
242 | Character | Bool, Char, Int8, Int16, Int32, UInt8, Uint16, Uint32 |
UTF-32 + modifiers, subtype in upper 32 bits |
251 | no dynamic typing use | ||
244 | Float | Float64 | IEEE-754 binary64 |
245 | Float16, Float32 | subtype in upper 32 bits | |
246..255 | no dynamic typing use |
In addition to Lisp types, SecureRISC0 could define tags for other dynamically typed languages, such as Python. Tuples, ranges, and sets might be examples. Other types, such as modules, might use a general structure-like building block rather than individual tags, as suggested for Lisp above.
71 | 64 | 63 | 61 | 60 | 50 | 49 | 41 | 40 | 25 | 24 | 15 | 14 | 11 | 10 | 9 | 6 | 5 | 0 | |||||||||
253 | hint | targr | targl | start | offset | size | c | next | prev | ||||||||||||||||||
8 | 3 | 11 | 9 | 16 | 10 | 4 | 1 | 4 | 6 |
63 | 61 | 60 | 58 | 57 | 48 | 47 | 0 | ||||
ring | SG | SEG | offset | ||||||||
3 | 3 | 10 | 48 |
63 | 50 | 49 | 0 | ||
region | offset | ||||
14 | 50 |
The user process state includes:
Name | Depth | Width | Read ports | Write ports | Description |
---|---|---|---|---|---|
PC | 1 | 3 + 58 + 5 | The Program Counter holds the current ring number, Basic Block descriptor address, and 5‑bit offset into the basic block of the next instruction. The 5‑bit offset is only visible on exceptions. | ||
CSP | 8 | 3 + 58 | The Call Stack Pointer holds the ring number and address of the return address stack maintained by call and return basic blocks. The Program Stack Pointer is held in an AR designated by the Software ABI. There is one CSP per ring. | ||
COUNT | 1 | 64 | The Loop Count register is used for the innermost loop that iterates up to a number of times determined prior to entering the loop. It is set by the MOVCOUNTA instruction and is decremented to zero by the LOOP instruction. | ||
CARRY | 1 | 64 |
The Carry register is used on multiplication as an implicit
input and output on multiplication as follows: p ← SR[c] + (SR[a] ×u SR[b]) + CARRY SR[d] ← p63..0 CARRY ← p127..64 It could also be used in the ADDC instruction as follows: s ← SR[a] +u SR[b] + CARRY0 SR[d] ← s63..0 CARRY ← 063 ∥ s64 but in this case it may be preferable to use BR source and destinations instead. |
||
VL | 1 | 64 | The Vector Length register specifies the length of vector loads, stores, and operations. | ||
VSTART | 1 | 7 | The Vector Start register is used to restart vector operations after exceptions. Details to follow. | ||
VM | 4 | 128 | The Vector Mask register file stores a bit mask for elements of vector operations. VM[0] is hardwired to all 1s and is used for unmasked operations. | ||
AR | 16 | 133 | 2 | 1 |
The Address Register file holds pointers and integers to perform
calculations related to control flow and to load and store address
generation. No AR is hardwired
to 0. Bits 63..0 are address or
data (bits 63..61 are the ring number if address), bits 71..64 are
the tag, and bits 132..72 are the size expanded from the tag,
potentially set by reading address−8. In some micro-architectures, operations on ARs are executed speculatively. (Non-AR operations may be queued until non-speculative, or may be speculatively executed as well.) |
XR | 16 | 72 | 2 | 1 |
The Index Register file holds integers to perform
calculations related to control flow and to load and store
address generation.
No XR is hardwired to 0.
Bits 63..0 are data and bits 71..64 are the tag. The XR
primarily holds integer tagged data, but other tags may be
loaded. In some micro-architectures, operations on XRs are executed speculatively. (Non-XR operations may be queued until non-speculative, or may be speculatively executed as well.) The XR register file requires two read ports and one write port per instruction. |
SR | 16 | 72 | 3 | 1 |
The Scalar Register file holds data for computations not involved
in address generation and primarily hold integer or floating-point
values. Tags are stored, and
so SRs may be used for copying
arbitrary data, including pointers, but no instruction
uses SRs as
an address (e.g. base) register. Integer operations check for
integer tags, and floating-point operations check for float tags.
No SR is hardwired to 0. In some micro-architectures, operations on SRs occur later in the pipeline than operations on ARs, separated by a queue, allowing these operations to wait for data cache misses while the AR engine continues to move ahead generating addresses. When multiple functional units operate in parallel, only some will support 3 source operands, with the others only two. The instructions with three SR source operands are multiply/add (both integer and floating-point), and funnel shifts. |
BR | 16 | 1 | 3 | 1 | Boolean Registers hold boolean values, such as the result of comparisons and logical operations on other boolean values. BRs are typically used to hold SR register comparisons and may avoid branch prediction misses in some algorithms. BR[0] is hardwired to 0. Attempts to write 1 to BR[0] trap, which converts such instructions into negative assertions. |
VR | 16 | 72 × 128 | 3 | 1 | Vector Registers hold vectors of tagged data, typically integers or floating-point data. |
The SR register file must support 3 read and 1 write port per instruction for floating-point multiply/add instructions at least. Since it does, other operations on SRs may take advantage of the third source operand.
The next field of the BB descriptor is used to specify how the successor to the current BB is determined. The values are given in the following table:
Value | Description |
---|---|
0 | Unconditional branch: The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below. |
1 | Conditional branch: The branch predictor is used to determine whether this branch is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. There should be exactly one branch, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below, or is the fall-through BB descriptor at PC + 8. |
2 | Call: The address PC + 8 is written to the word pointed to CSP[TargetPC63..61], and CSP[TargetPC63..61] is incremented by 8. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below. |
3 | Conditional Call: The branch predictor is used to determine whether this call is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the call indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. In the case the call is not taken the destination is fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC63..61], and CSP[TargetPC63..61] is incremented by 8. |
4 | Loop back: The shadow value of the COUNT register is used to determine whether this branch is taken or not, and this prediction is checked by the LOOP instruction in the instructions of the basic block. There should be exactly one LOOP, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below, or is the fall-through BB descriptor at PC + 8. |
5 | Conditional Loop back: The branch predictor is used to determine whether this loop back is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the loop back is enabled by the branch, the shadow value of the COUNT register is used to determine whether this loop is taken or not, and this prediction is checked by the LOOP instruction in the instructions of the basic block. There should be exactly one LOOP, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below, or is the fall-through BB descriptor at PC + 8. |
6 | Fall through: This Basic Block is unconditionally followed by the BB at PC + 8. |
7 | Reserved. |
8 | Jump Indirect: The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. |
9 |
Conditional Jump Indirect: The branch predictor is used to
determine whether this jump indirect is taken or not, and this
prediction is checked by the branch decision is given by a branch
instruction in the instructions of the basic block. If the jump
indirect is enabled by the branch, the indirect jump predictor is
used to predict the destination BB descriptor address, and this
prediction is checked by the JUMP
instruction in the instructions of the basic block. There should
be exactly one JUMP, which may be
located anywhere in the basic block instructions. In the case the
jump is not taken the destination is fall-through BB
descriptor at PC + 8. This type is expected to be used for case dispatch, where the conditional test checks whether the value is within range, and the JUMP uses PC ← PC + (XR[b] × 8) to choose one of several dispatch basic block descriptors, presuming that the BBs fit in the same 4 KiB region (if not then a table and PC ← lvload72(AR[a] + XR[b]) should be used). |
10 | Call Indirect: The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. The address PC + 8 is written to the word pointed to CSP[TargetPC63..61], and CSP[TargetPC63..61] is incremented by 8. |
11 | Conditional Call Indirect: The branch predictor is used to determine whether this call indirect is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the call indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JUMP instruction in the instructions of the basic block. There should be exactly one JUMP, which may be located anywhere in the basic block instructions. In the case the call is not taken the destination is fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC63..61], and CSP[TargetPC63..61] is incremented by 8. |
12 | Return: The Call Stack cache is used to predict the return using CSP[PC63..61] − 8 as the index and CSP[PC63..61] is decremented by 8. |
13 | Reserved. |
14 | Reserved. |
15 | Reserved. |
The prev field of the BB descriptor is used to specify what methods are allowed to get to this BB as a set of bits, with prev1..0 controlling interpretation of prev5..2:
Bit | Description |
---|---|
2 | Fall through to this BB allowed |
3 | Branch/Loopback to this BB allowed |
4 | Jump to this BB allowed (for case dispatch) |
5 | Return to this BB allowed |
Bit | Description |
---|---|
2 | Call allowed |
3 | Gate allowed |
4 | Reserved |
5 | Reserved |
Bit | Description |
---|---|
2 | Reserved |
3 | Reserved |
4 | Reserved |
5 | Reserved |
Bit | Description |
---|---|
2 | Exception entry |
3 | Reserved |
4 | Reserved |
5 | Reserved |
Basic Block descriptors with one of the four call types (Call,
Conditional Call, Call Indirect, Conditional Call Indirect), push the
return address on a protected stack addressed by
the CSP indexed by the target ring
number (which is the same as the current ring number unless a gate is
addressed). Returns pop the address from the protected stack and jump
to it. The ring number of the CSP
pointer is used for the stores and loads, and typically this ring is not
writeable by the current ring.
The call semantics are as follows:
lvstore72(CSP[TargetPC63..61]) ← PC
CSP[TargetPC63..61]) ← CSP[TargetPC63..61]) +p 8
The return semantics are as follows:
PC ← lvload72(CSP[TargetPC63..61] −p 8)
CSP[TargetPC63..61]) ← CSP[TargetPC63..61]) −p 8
Overflow detection is important for implementing bignums in languages such as Lisp. SecureRISC0 provides a reasonably complete set of such instructions in addition to the usual mod 264 add, subtract, negate, multiply, and shift left.
Unsigned overflow could be detected by using the ADDC and SUBC instructions with BR[0] as the carry-in and BR[0] as the carry-out. But it might also make sense to have ADDUO (Add Unsigned with Overflow).
In addition the ADDSO, ADDUSO (Add Signed with Overflow), SUBSO (Subtract Signed with Overflow), SUBSUO (Subtract Signed Unsigned with Overflow), SUBUSO (Subtract Unsigned Signed with Overflow), and NEGO (Negate with Overflow) instructions provide overflow checking for signed addition, subtraction, and negation, and signed-unsigned addition and subtraction. There is also SLLO (Shift Left Logical with Overflow) and SLAO (Shift Left Arithmetic with Overflow) in addition to the usual SLL. Finally there are MULUO, MULSO, and MULSUO for multiplication with overflow detection.
Overflow in the unsigned addition of load/store effective address generation is trapped. Segment bounds are also checked during effective address generation: the segment size is determined from the base register, and the effective address must agree with the base register for bits 60..size (this is hard to implement—might need another cache?).
SecureRISC0 has trap instructions and Boolean Registers (BRs) primarily as a way to avoid conditional branching for computation. For example, to compute the min of x1 and x3 into x6, the RISC‑V ISA would use conditional branches:
move x6, x1 blt x1, x3, L move x6, x3 L:
The performance of the above on contemporary micro-architectures depends on the conditional branch prediction rate and the mispredict penalty, which in turn depends on how consistently x1 or x3 is the minimum value. In SecureRISC0, the sequence could be as follows:
lt b2, s1, s3 sel s6, b2, s1, s3
This sequence involves no conditional branches, and has consistent performance.
As another example, the range test
assert ((lo <= x) && (x <= hi));
on RISC‑V would compile to
blt x, lo, T bge hi, x, L T: jal assertionfailed L:
but on SecureRISC0 would compile to
lt b1, x, lo orle b0, b1, hi, x
which involves no conditional branches, but instead using writes to b0 as a negative assertion check (trap if the value to be written is 1). The assembler would also accept torle b1, hi, x as equivalent to the above orle by supplying the b0 destination operand.
Even when conditional branches are used, the boolean registers sometimes permit several tests to be combined before branching, so if we were branching on the range test above, instead of asserting it, the code might be
lt b1, x, lo borle b1, hi, x, outofrange
which has one branch rather than two.
Operations on tagged values trap if the tags are unexpected values. Integer addition requires that both tags be integers, or one tag be a pointer type and the other an integer. Integer subtraction requires the subtrahend tag to be an integer tag and the minuend to be either an integer or pointer tag. The resulting tag is integer with all integer sources, or pointer if one operand is a pointer. Integer bitwise logical operands and shifts require integer tagged operands and produce an integer tagged result. Floating point addition, subtraction, multiplication, division, and square root require floating-point tagged operands. To perform integer operands on floating-point tagged values (e.g. to extract the exponent) requires a CAST instruction to first change the tag. Similarly to perform logical operations on a pointer, a CAST instruction to integer type is required.
Comparisons of tagged values compare the entire word in its entirety for =, ≠, <u, ≥u etc. This allows sorting regardless of type. Similarly the CMPU operation produces −1, 0, 1 based on <u, =, >u of word values.
One advantage of the 3 read SR file is
that shifts can be based upon a
funnel shift where the value to be shifted is the catenation
of SR[a]
and SR[b],
allowing for rotates by specifying the same operand for the high and low
funnel operands, and multiword shifts by supplying adjacent source words
of the multiword value. The basic operations are then
SR[d] ← (SR[b] ∥ SR[a]) >> imm6,
SR[d] ← (SR[b] ∥ SR[a]) >> (SR[c] mod 64), and
SR[d] ← (SR[b] ∥ SR[a]) >> (−SR[c] mod 64).
Conventional logical and arithmetic shifts are also provided. Left
shifts supply 0 for the lo side of the funnel and use a negative shift
amount. Logical right shifts supply 0 on the high side of the funnel
and arithmetic right shifts supply a signed-extended version
of SR[a] on the high side of the funnel.
Need to decide whether overflow detecting left shifts are required.
The CARRY register could be use as funnel shift operand instead of an SR, but that seems less flexible.
The Add with Carry
instruction ADDC is
defined to take a BR source as a
carry-in and BR destination as a
carry-out as a destination for multiword addition. The definition is
then
s ← SR[a] +u SR[b] +u BR[c]
SR[d] ← s63..0
BR[e] ← s64.
An alternative requires fewer operands, but uses one bit in the
64‑bit CARRY register:
s ← SR[a] +u SR[b] +u CARRY0
SR[d] ← s63..0
CARRY ← 063 ∥ s64.
The ideal multiplication operation would be
SR[e],SR[d] ← (SR[a] ×u SR[b]) + SR[c] + SR[f]
to efficiently support multiword multiplication, but that requires 4
reads and 2 writes, which we clearly don’t want. The tentative
alternative is to introduce a
64‑bit CARRY register to provide
the additional 64‑bit input to the 128‑bit product and
a place to store the high 64 bits of the product. This
requires some careful thought for OoO micro-architectures and so
is a tentative proposal. It may be that even an OoO processor
will be called on to have a subset of instructions that are to be
executed in-order relative to each other, and the multiword
arithmetic instructions can be put in this queue.
It may be appropriate to add some instructions that exist only for code size reduction, which expand into multiple SecureRISC instructions early in the pipeline (e.g. before register renaming). The best candidates for this so far are doubleword load/store instructions, which would expand into two singleword load/store instructions. This expansion and execution as separate instructions in the backend of the pipeline avoids the issues with register renaming that would otherwise exist. The partial execution of part of the pair would be allowed (and loads to the source registers would be not allowed). Doubleword load/store significantly reduce the size of function call entry and exit, and may be useful for loading a code pointer and context pointer pair for indirect calls.
The following outlines some of the instructions without giving them their full definitions, which includes tag and bounds checking. The full definitions will follow later.
The 16‑bit instruction formats are included for code density. Some evaluation of whether it is worth the cost should be considered. Note that the BB descriptor gives the sizes of all instructions in the basic block in the form of the start bit mask, and so the instruction size is not encoded in the opcodes. The start mask allows multiple instructions to be decoded in parallel without parsing the instruction stream.
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
b | a | d | op1 | ||||||||
4 | 4 | 4 | 4 |
ADDA | d, a, b | AR[d] ← AR[a] +p XR[b] |
ADDX | d, a, b | XR[d] ← XR[a] + XR[b] |
ADD | d, a, b | SR[d] ← SR[a] + SR[b] |
LAD | d, a, b | AR[d] ← lvload144(AR[a] +p AR[b]×16) |
LA | d, a, b | AR[d] ← lvload72(AR[a] +p XR[b]×8) |
LX | d, a, b | XR[d] ← lvload72(AR[a] +p XR[b]×8) |
L | d, a, b | SR[d] ← lvload72(AR[a] +p XR[b]×8) |
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
imm4 | a | d | op1 | ||||||||
4 | 4 | 4 | 4 |
ADDAI | d, a, imm4 | AR[d] ← AR[a] +p imm4 |
ADDXI | d, a, imm4 | XR[d] ← XR[a] + imm4 |
ADDI | d, a, imm4 | SR[d] ← SR[a] + imm4 |
LAI | d, a, imm4 | AR[d] ← lvload72(AR[a] +p imm4×8) |
LXI | d, a, imm4 | XR[d] ← lvload72(AR[a] +p imm4×8) |
LI | d, a, imm4 | SR[d] ← lvload72(AR[a] +p imm4×8) |
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
op1da | a | d | op1 | ||||||||
4 | 4 | 4 | 4 |
RTAG | d, a | AR[d] ← 240 ∥ 056 ∥ AR[a]71..64 |
RSIZE | d, a | AR[d] ← 240 ∥ 03 ∥ AR[a]132..72 |
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
b | a | op1ab | op1 | ||||||||
4 | 4 | 4 | 4 |
BEQA | a, b | branch if AR[a] = AR[b] |
BEQX | a, b | branch if XR[a] = XR[b] |
BNEA | a, b | branch if AR[a] ≠ AR[b] |
BNEX | a, b | branch if XR[a] ≠ XR[b] |
BLTAU | a, b | branch if AR[a] <u AR[b] |
BLTXU | a, b | branch if XR[a] <u XR[b] |
BGEAU | a, b | branch if AR[a] ≥u AR[b] |
BGEXU | a, b | branch if XR[a] ≥u XR[b] |
BLTX | a, b | branch if XR[a] <s XR[b] |
BGEX | a, b | branch if XR[a] ≥s XR[b] |
BNONEX | a, b | branch if (XR[a] & XR[b]) = 0 |
BANYX | a, b | branch if (XR[a] & XR[b]) ≠ 0 |
TEQA | a, b | trap if AR[a] = AR[b] |
TEQX | a, b | trap if XR[a] = XR[b] |
TNEA | a, b | trap if AR[a] ≠ AR[b] |
TNEX | a, b | trap if XR[a] ≠ XR[b] |
TLTAU | a, b | trap if AR[a] <u AR[b] |
TLTXU | a, b | trap if XR[a] <u XR[b] |
TGEAU | a, b | trap if AR[a] ≥u AR[b] |
TGEXU | a, b | trap if XR[a] ≥u XR[b] |
TLTX | a, b | trap if XR[a] <s XR[b] |
TGEX | a, b | trap if XR[a] ≥s XR[b] |
TNONEX | a, b | trap if (XR[a] & XR[b]) = 0 |
TANYX | a, b | trap if (XR[a] & XR[b]) ≠ 0 |
15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | ||||
op1a | a | op1ab | op1 | ||||||||
4 | 4 | 4 | 4 |
BEQNA | a | branch if AR[a]71..64 = 0 |
BNENA | a | branch if AR[a]71..64 ≠ 0 |
BEQZX | a | branch if XR[a] = 0 |
BNEZX | a | branch if XR[a] ≠ 0 |
BLTZX | a | branch if XR[a] <s 0 |
BGEZX | a | branch if XR[a] ≥s 0 |
BLEZX | a | branch if XR[a] ≤s 0 |
BGTZX | a | branch if XR[a] >s 0 |
BF | a | branch if BR[a] = 0 |
BT | a | branch if BR[a] ≠ 0 |
TEQZX | a | trap if XR[a] = 0 |
TNEZX | a | trap if XR[a] ≠ 0 |
TLTZX | a | trap if XR[a] <s 0 |
TGEZX | a | trap if XR[a] ≥s 0 |
TLEZX | a | trap if XR[a] ≤s 0 |
TGTZX | a | trap if XR[a] >s 0 |
TF | a | trap if BR[a] = 0 |
TT | a | trap if BR[a] ≠ 0 |
JMP | a | PC ← AR[a] |
LOOP |
COUNT ← COUNT − 1 branch if COUNT ≠ 0 |
15 | 8 | 7 | 4 | 3 | 0 | |||
imm8 | d | op1 | ||||||
8 | 4 | 4 |
XI | d, imm8 | XR[d] ← 240 ∥ imm8748 ∥ imm8 |
I | d, imm8 | SR[d] ← 240 ∥ imm8748 ∥ imm8 |
31 | 28 | 27 | 22 | 21 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op20 | op21 | m | c | b | a | d | op2 | |||||||||||||||
4 | 6 | 2 | 4 | 4 | 4 | 4 | 4 |
ao1ao2 | d, c, a, b | SR[d] ← SR[c] ao1 (SR[a] ao2 SR[b]) |
Example ao1 might be: + (ADD) − (SUB) | ||
Example ao2 might be: + (ADD) − (SUB) × (MUL) | ||
FUN | d, b, a, c | t ← (SR[b]63..0∥SR[a]63..0) >> SR[c]5..0 SR[d] ← 240 ∥ t63..0 |
FUNN | d, b, a, c | t ← (SR[b]63..0∥SR[a]63..0) >> (−SR[c])5..0 SR[d] ← 240 ∥ t63..0 |
fo1fo2.D | d, c, a, b | SR[d] ← SR[c] fo1 (SR[a] fo2 SR[b]) |
Vfo1fo2 | d, c, a, b, m | VR[d] ← VR[c] ao1 (VR[a] ao2 VR[b]) masked by VM[m] |
Vfo1fo2 | d, c, a, b, m | VR[d] ← VR[c] ao1 (VR[a] ao2 SR[b]) masked by VM[m] |
Vfo1fo2.D | d, c, a, b, m | VR[d] ← VR[c] fo1 (VR[a] fo2 VR[b]) masked by VM[m] |
Vfo1fo2.D | d, c, a, b, m | VR[d] ← VR[c] fo1 (VR[a] fo2 SR[b]) masked by VM[m] |
Example fo1 might be: +f (ADD) −f (SUB) | ||
Example fo2 might be: +f (ADD) −f (SUB) ×f (MUL) | ||
bo1bo2 | d, c, a, b | SR[d] ← SR[c] bo1 (SR[a] bo2 SR[b]) |
Example bo1 might be: & (AND) | (OR) ^ (XOR) | ||
Example bo2 might be: & (AND) | (OR) &~ (ANDC) |~ (ORC) ^ (XOR) ^~ (XORC) << (SLL) >>u (SRL) >>s (SRA) | ||
SEL | d, c, a, b | SR[d] ← BR[c] ? SR[a] : SR[b] |
lo1lo2 | d, c, a, b | BR[d] ← BR[c] lo1 (BR[a] lo2 BR[b]) |
lo1copA | d, c, a, b | BR[d] ← BR[c] lo1 (AR[a] cop AR[b]) |
lo1copX | d, c, a, b | BR[d] ← BR[c] lo1 (XR[a] cop XR[b]) |
lo1cop | d, c, a, b | BR[d] ← BR[c] lo1 (SR[a] cop SR[b]) |
Vlo1cop | d, c, a, b | VM[d] ← VM[c] lo1 (VR[a] cop VR[b]) |
Vlo1cop | d, c, a, b | VM[d] ← VM[c] lo1 (VR[a] cop SR[b]) |
31 | 28 | 27 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op20 | i | c | i | a | d | op2 | ||||||||||||||
4 | 8 | 4 | 4 | 4 | 4 | 4 |
ao1ao2I | d, a, b, imm | SR[d] ← SR[c] ao1 (SR[a] ao2 imm12) |
bo1bo2I | d, a, b, imm | SR[d] ← SR[c] bo1 (SR[a] bo2 imm12) |
SELI | d, c, a, imm12 | SR[d] ← BR[c] ? SR[a] : imm12 |
lo1copI | d, a, b, imm | BR[d] ← BR[c] lo1 (AR[a] cop imm12) |
lo1copI | d, a, b, imm | BR[d] ← BR[c] lo1 (SR[a] cop imm12) |
Vlo1copI | d, a, b, imm | VM[d] ← VM[a] lo1 (VR[b] cop imm12) |
BR[0] is hardwired to 0. Using BR[0] as a destination acts as negative assertion, taking an exception if the value computed is 1. | ||
Example lo1/lo2: & (AND) | (OR) ^ (XOR) &~ (ANDC) |~ (ORC) ^~ (XORC) | ||
Example cop: = (EQ) ≠ (NE) <u (LTU) <s (LT) ≥u (GEU) ≥u (GE) tag= tag≠ tag< tag≥ word= word≠ word< word≥ |
31 | 28 | 27 | 22 | 21 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op20 | op21 | op23 | b | a | d | op2 | ||||||||||||||
4 | 6 | 6 | 4 | 4 | 4 | 4 |
op0X | d, a, b | XR[d] ← XR[a] op0 XR[b] |
Example op0 might be: + (ADD) − (SUB) << (SLL) >>u (SRL) >>s (SRA) | ||
Possible op0 might include: minu mins, maxu maxs | ||
ao2 | d, b, c | SR[d] ← SR[a] ao2 SR[b] |
LX32U | d, a, b | t ← lvload32(AR[a] +p XR[b]×4) XR[d] ← 240 ∥ 032 ∥ t |
L32U | d, a, b | t ← lvload32(AR[a] +p XR[b]×4) SR[d] ← 240 ∥ 032 ∥ t |
LX32S | d, a, b | t ← lvload32(AR[a] +p XR[b]×4) XR[d] ← 240 ∥ t3132 ∥ t |
L32S | d, a, b | t ← lvload32(AR[a] +p XR[b]×4) SR[d] ← 240 ∥ t3132 ∥ t |
LX16U | d, a, b | t ← lvload16(AR[a] +p XR[b]×2) XR[d] ← 240 ∥ 048 ∥ t |
L16U | d, a, b | t ← lvload16(AR[a] +p XR[b]×2) SR[d] ← 240 ∥ 048 ∥ t |
LX16S | d, a, b | t ← lvload16(AR[a] +p XR[b]×2) XR[d] ← 240 ∥ t1548 ∥ t |
L16S | d, a, b | t ← lvload16(AR[a] +p XR[b]×2) SR[d] ← 240 ∥ t1548 ∥ t |
LX8U | d, a, b | t ← lvload8(AR[a] +p XR[b]) XR[d] ← 240 ∥ 056 ∥ t |
L8U | d, a, b | t ← lvload8(AR[a] +p XR[b]) SR[d] ← 240 ∥ 056 ∥ t |
LX8S | d, a, b | t ← lvload8(AR[a] +p XR[b]) XR[d] ← 240 ∥ t756 ∥ t |
L8S | d, a, b | t ← lvload8(AR[a] +p XR[b]) SR[d] ← 240 ∥ t756 ∥ t |
31 | 28 | 27 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op20 | i | op24 | i | a | d | op2 | ||||||||||||||
4 | 8 | 4 | 4 | 4 | 4 | 4 |
op0AI | d, a, imm | AR[d] ← AR[a] op0 imm12 |
op0XI | d, a, imm | XR[d] ← XR[a] op0 imm12 |
ao2I | d, b, imm | SR[d] ← SR[a] ao2 imm12 |
LADI | d, a, imm | AR[d] ← lvload144(AR[a] +p imm12×16) |
LAI | d, a, imm | AR[d] ← lvload72(AR[a] +p imm12×8) |
LXI | d, a, imm | XR[d] ← lvload72(AR[a] +p imm12×8) |
LI | d, a, imm | SR[d] ← lvload72(AR[a] +p imm12×8) |
LX32UI | d, a, imm | t ← lvload32(AR[a] +p imm12×4) XR[d] ← 240 ∥ 032 ∥ t |
L32UI | d, a, imm | t ← lvload32(AR[a] +p imm12×4) SR[d] ← 240 ∥ 032 ∥ t |
LX32SI | d, a, imm | t ← lvload32(AR[a] +p imm12×4) XR[d] ← 240 ∥ t3132 ∥ t |
L32SI | d, a, imm | t ← lvload32(AR[a] +p imm12×4) SR[d] ← 240 ∥ t3132 ∥ t |
LX16UI | d, a, imm | t ← lvload16(AR[a] +p imm12×2) XR[d] ← 240 ∥ 048 ∥ t |
L16UI | d, a, imm | t ← lvload16(AR[a] +p imm12×2) SR[d] ← 240 ∥ 048 ∥ t |
LX16SI | d, a, imm | t ← lvload16(AR[a] +p imm12×2) XR[d] ← 240 ∥ t1548 ∥ t |
L16SI | d, a, imm | t ← lvload16(AR[a] +p imm12×2) SR[d] ← 240 ∥ t1548 ∥ t |
LX8UI | d, a, imm | t ← lvload8(AR[a] +p imm12) XR[d] ← 240 ∥ 056 ∥ t |
L8UI | d, a, imm | t ← lvload8(AR[a] +p imm12) SR[d] ← 240 ∥ 056 ∥ t |
LX8SI | d, a, imm | t ← lvload8(AR[a] +p imm12) XR[d] ← 240 ∥ t756 ∥ t |
L8SI | d, a, imm | t ← lvload8(AR[a] +p imm12) SR[d] ← 240 ∥ t756 ∥ t |
MOVACOUNT | d | AR[d] ← 240 ∥ COUNT |
MOVCOUNTA | a | COUNT ← AR[a]63..0 |
MOVAS | d, a | AR[d] ← SR[a] |
MOVSA | d, a | SR[d] ← AR[a] |
MOVAB | d, a | AR[d] ← 240 ∥ 063 ∥ BR[a] |
MOVBA | d, a, imm6 | BR[d] ← AR[a]imm6 |
MOVSB | d, a | SR[d] ← 240 ∥ 063 ∥ BR[a] |
MOVBS | d, a, imm6 | BR[d] ← SR[a]imm6 |
MOVSBALL | d | SR[d] ← 240 ∥ 048 ∥ BR[15]∥BR[14]∥…∥BR[1]∥0 |
MOVSVM | d, m, w | SR[d] ← 240 ∥ VM[m]w×64+63..w×64 |
MOVVMS | d, a, w | VM[d]w×64+63..w×64 ← SR[a] |
31 | 28 | 27 | 22 | 21 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op20 | op21 | imm6 | b | a | d | op2 | ||||||||||||||
4 | 6 | 6 | 4 | 4 | 4 | 4 |
FUNI | d, a, b, i | t ← (SR[b]63..0∥SR[a]63..0) >> imm6 SR[d] ← 240 ∥ t63..0 |
31 | 28 | 27 | 22 | 21 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op20 | op21 | m | c | b | a | op22 | op2 | |||||||||||||||
4 | 6 | 2 | 4 | 4 | 4 | 4 | 4 |
SA | c, a, b | lvstore72(AR[a] +p XR[b]×8) ← AR[c] |
SX | c, a, b | lvstore72(AR[a] +p XR[b]×8) ← XR[c] |
S | c, a, b | lvstore72(AR[a] +p XR[b]×8) ← SR[c] |
SX32 | c, a, b | lvstore32(AR[a] +p XR[b]×4) ← XR[c]31..0 |
S32 | c, a, b | lvstore32(AR[a] +p XR[b]×4) ← SR[c]31..0 |
SX16 | c, a, b | lvstore16(AR[a] +p XR[b]×2) ← XR[c]15..0 |
S16 | c, a, b | lvstore16(AR[a] +p XR[b]×2) ← SR[c]15..0 |
SX8 | c, a, b | lvstore8(AR[a] +p XR[b]) ← XR[c]7..0 |
S8 | c, a, b | lvstore8(AR[a] +p XR[b]) ← SR[c]7..0 |
Blo2 | a, b | branch if BR[a] lo2 BR[b] |
(equivalent to BORlo2 b0, a, b) | ||
BEQA | a, b | branch if AR[a] = AR[b] |
(equivalent to BOREQA b0, a, b) | ||
BEQX | a, b | branch if XR[a] = XR[b] |
BNEA | a, b | branch if AR[a] ≠ AR[b] |
BNEX | a, b | branch if XR[a] ≠ XR[b] |
BLTAU | a, b | branch if AR[a] <u AR[b] |
BLTXU | a, b | branch if XR[a] <u XR[b] |
BGEAU | a, b | branch if AR[a] ≥u AR[b] |
BGEXU | a, b | branch if XR[a] ≥u XR[b] |
BLTX | a, b | branch if XR[a] <s XR[b] |
BGEX | a, b | branch if XR[a] ≥s XR[b] |
BNONEX | a, b | branch if (XR[a] & XR[b]) = 0 |
BANYX | a, b | branch if (XR[a] & XR[b]) ≠ 0 |
Blo1lo2 | c, a, b | branch if BR[c] lo1 (BR[a] lo2 BR[b]) |
Blo1EQA | c, a, b | branch if BR[c] lo1 (AR[a] = AR[b]) |
Blo1EQX | c, a, b | branch if BR[c] lo1 (XR[a] = XR[b]) |
Blo1NEA | c, a, b | branch if BR[c] lo1 (AR[a] ≠ AR[b]) |
Blo1NEX | c, a, b | branch if BR[c] lo1 (XR[a] ≠ XR[b]) |
Blo1LTAU | c, a, b | branch if BR[c] lo1 (AR[a] <u AR[b]) |
Blo1LTXU | c, a, b | branch if BR[c] lo1 (XR[a] <u XR[b]) |
Blo1GEAU | c, a, b | branch if BR[c] lo1 (AR[a] ≥u AR[b]) |
Blo1GEXU | c, a, b | branch if BR[c] lo1 (XR[a] ≥u XR[b]) |
Blo1LTX | c, a, b | branch if BR[c] lo1 (XR[a] <s XR[b]) |
Blo1GEX | c, a, b | branch if BR[c] lo1 (XR[a] ≥s XR[b]) |
Blo1NONEX | c, a, b | branch if BR[c] lo1 ((XR[a] & XR[b]) = 0) |
Blo1ANYX | c, a, b | branch if BR[c] lo1 ((XR[a] & XR[b]) ≠ 0) |
31 | 28 | 27 | 20 | 19 | 16 | 15 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||||
op20 | i | c | i | a | op22 | op2 | ||||||||||||||
4 | 8 | 4 | 4 | 4 | 4 | 4 |
SAI | c, a, imm | lvstore72(AR[a] +p imm12×8) ← AR[c] |
SI | c, a, imm | lvstore72(AR[a] +p imm12×8) ← SR[c] |
SA32I | c, a, imm | lvstore32(AR[a] +p imm12×4) ← AR[c]31..0 |
S32I | c, a, imm | lvstore32(AR[a] +p imm12×4) ← SR[c]31..0 |
SA16I | c, a, imm | lvstore16(AR[a] +p imm12×2) ← AR[c]15..0 |
S16I | c, a, imm | lvstore16(AR[a] +p imm12×2) ← SR[c]15..0 |
SA8I | c, a, imm | lvstore8(AR[a] +p imm12) ← AR[c]7..0 |
S8I | c, a, imm | lvstore8(AR[a] +p imm12) ← SR[c]7..0 |
BEQXI | a, imm12 | branch if XR[a] = imm12 |
(equivalent to BOREQXI b0, a, imm12) | ||
BNEXI | a, imm12 | branch if XR[a] ≠ imm12 |
BLTXUI | a, imm12 | branch if XR[a] <u imm12 |
BGEXUI | a, imm12 | branch if XR[a] ≥u imm12 |
BLTXI | a, imm12 | branch if XR[a] <s imm12 |
BGEXI | a, imm12 | branch if XR[a] ≥s imm12 |
BNONEXI | a, imm12 | branch if (XR[a] & imm12) = 0 |
BANYXI | a, imm12 | branch if (XR[a] & imm12) ≠ 0) |
Blo1EQXI | c, b, imm12 | branch if BR[c] lo1 (XR[a] = imm12) |
Blo1NEXI | c, a, imm12 | branch if BR[c] lo1 (XR[a] ≠ imm12) |
Blo1LTUXI | c, a, imm12 | branch if BR[c] lo1 (XR[a] <u imm12) |
Blo1GEUXI | c, a, imm12 | branch if BR[c] lo1 (XR[a] ≥u imm12) |
Blo1LTXI | c, a, imm12 | branch if BR[c] lo1 (XR[a] <s imm12) |
Blo1GEXI | c, a, imm12 | branch if BR[c] lo1 (XR[a] ≥s imm12) |
Blo1NONEXI | c, a, imm12 | branch if BR[c] lo1 ((XR[a] & imm12) = 0) |
Blo1ANYXI | c, a, imm12 | branch if BR[c] lo1 ((XR[a] & imm12) ≠ 0) |
SWITCHR | b | PC ← PC +p (XR[b]×8) |
Used when all cases are in the current bage. | ||
SWITCHI | a, imm12 | PC ← AR[a] +p (imm12×8) |
SWITCH | a, b | PC ← AR[a] +p (XR[b]×8) |
LJMPI | a, imm12 | PC ← lvload72(AR[a] +p imm12×8) |
LJMP | a, b | PC ← lvload72(AR[a] +p XR[b]×8) |
31 | 28 | 27 | 12 | 11 | 8 | 7 | 4 | 3 | 0 | |||||
op20 | imm16 | a | d | op2 | ||||||||||
4 | 16 | 4 | 4 | 4 |
ALLOCI | d, a, imm7 | AR[d] ← 051∥imm7∥03 ∥ imm7 ∥ min(AR[a]63..61, PC63..61) ∥ (AR[a]60..0 + AR[a]132..72) |
ALLOCI | d, a, imm16 |
lvstore72(AR[a] + 8) ← 254∥048∥imm16 AR[d] ← −(048∥imm16) ∥ 254 ∥ min(AR[a]63..61, PC63..61) ∥ (AR[a]60..0+AR[a]132..72+16) |
Primarily used for allocating stack frames with a15: | ||
ALLOCI | sp, sp, imm |
31 | 8 | 7 | 4 | 3 | 0 | |||
imm24 | d | op2 | ||||||
24 | 4 | 4 |
XI | d, imm | XR[d] ← 240 ∥ imm242340∥imm24 |
I | d, imm | SR[d] ← 240 ∥ imm242340∥imm24 |
I expect SecureRISC0 software to use the ILP64 model, where integers and pointers are both 64 bits. Even in the 1980s when MIPS was defining its 64‑bit ISA, I argued that integers should be 64 bits, but keeping integers 32 bits for C was considered sacred by others. The result is that an integer cannot index a large array, which is terrible. With ILP64, I don’t expect SecureRISC0 to need special 32‑bit add instructions (that sign-extend from bit 31 to bits 63..32).
Translation of local virtual addresses to system interconnect addresses is typically performed in a single processor cycle in one of several L1 TLBs, which may be supplemented with one or more L2 TLBs. If the TLBs fail to provide translate the address, then the processor performs a more lengthy procedure, and if that succeeds, then the result is written into the TLBs to speed later translations. This TLB miss procedure determines the memory architecture. As described above, SecureRISC0 uses both segmentation and paging in its memory architecture. The first step of a TLB miss is therefore to determine a segment descriptor and then proceed as that directs. One way of thinking about SecureRISC0 segmentation is that is a specialized first-level page table that controls the subsequent levels, including giving the page size and table depth (derived from the segment size).
SecureRISC0 segments may be directly mapped to an aligned system virtual address range equal to the segment size, or they may be paged. Direct mapping may be appropriate to I/O regions, for example. It consists of simply changing the high bits (above the segment size) of the local virtual address to the appropriate system virtual address bits, and leaving the low bits (below the segment size) unchanged.
Paging in SecureRISC0 takes advantage of segment sizes to be more efficient than in some ISAs and also supports multiple page sizes. The proposal for SecureRISC0 here is that each segment has a page size that is used for all subsequent levels, but this could be generalized to allow a programmable size at each level at the cost of complexity in hardware. My current thoughts on those page sizes are 4 KiB, 16 KiB, and 1 MiB, but which sizes would eventually be supported is a matter for evaluation. Only the last level page size affects the TLB in many micro-architectures, so page size at earlier levels is primarily a question of complexity in the hardware table walk. Supporting multiple page sizes in TLBs is costly, and should be done in a limited way, and I am concerned even about proposing three sizes.
Aside: 1024 words (which became 4 KiB with byte addressing) was frequently chosen as the page size back in the 1960s as the trade-off between the memory wasted by allocating in page units and the size of the page table. This size has been carried forward with some variation for decades. The trade-offs are different in 2020s from the 1960s, so it deserves another look. Even the old 1024 words would suggest a page size of 8 KiB today. Today, with much larger address spaces, multi-level page tables are typically used, often with the same page size at each level. The number of levels, and therefore the TLB miss penalty is then a factor in the page size consideration that did not exist in the 1960s.
Aside: RISC‑V’s Sv39 model has three page sizes for TLBs to match: 4 KiB, 2 MiB, and 1 GiB. Sv48 adds 512 GiB, and Sv57 adds 256 TiB. The large page sizes were chosen as early outs from multi-level table walks, and don’t necessarily represent optimal sizes for things like I/O mapping or large HPC workloads. Multiple sizes does complicate TLB matching.
TLBs introduce one other complication. Typically when the supervisor switches from one process to another, it changes the segment and page tables. Absent an optimization, it would be necessary to flush the TLBs on any change in the tables, which is both costly in the cycles to flush and the misses that follow reloading the TLBs on memory references following the switch. Most processors with TLBs introduce a mechanism to reduce how often the TLB must be flushed, such as the Address Space Identifier (ASID) found in the MIPS translation hardware. The ASID is stored in the TLB, and when the supervisor switches to a new process, it either uses the process’ previous ASID, or assigns a new one if the TLB has been flushed since the last time the process ran. This allows its previous TLB entries to be used if they are still present in the TLB, but also avoids the TLB flush. When the ASIDs are used up, the TLB is flushed, and then ASID assignment starts fresh as processes are run. For example, a 5‑bit ASID would then require a TLB flush only when the 33rd distinct process is run after the last flush. The supervisor often uses translation and paging for its own data structures, some of which are process-specific, and some of which are common. To not require multiple TLB entries for the supervisor pages common between processes, a Global bit was introduced in the MIPS and other TLBs. This bit caused the TLB entry to ignore the ASID during the match process; such entries match any ASID. This whole issue occurs a second time when hypervisors switch between multiple guest operating systems, each of which thinks it controls the ASIDs in the TLB. RISC‑V for example introduced a VMID controlled by the hypervisor that works analogously to the ASID.
SecureRISC0 needs an ASID mechanism and a way to ignore for the same reason as in other ISAs. The question is whether this mechanism needs to be generalized, just as rings are a generalization of of supervisor and user mode. I propose one such possible generalization with eight possible sharing opportunities, but whether this is required may be reevaluated. Perhaps SecureRISC0 will revert to a simple Global bit or just ASID=0 to mean Global. There is no particular reason to choose eight. Below is the mechanism proposed. Again, we expect that various service levels in the system will have some segments common to all of the service levels that they support, and that these should require only a single TLB entry, but that other segments might be change their translation for each supported service level.
The simplest implementation for a Segment Descriptor Table (SDT) is to have a single Segment Descriptor Table Pointer (SDTP) register and use a Global bit in Page Table Entries (PTEs). My alternative ASID generalization is to groups segments into eight groups (SG), and give each group its own SDT, as addressed by eight SDTP registers. These eight registers are then the zero level table, followed by the chosen Segment Descriptor Table (the first level), followed by zero to four levels of page table. Since the registers are not in memory, there are one to five levels of memory tables to walk starting with the Segment Descriptor Table. The segment size in the SDT allows the length of the walk to be per-segment, so most code segments (e.g. shared libraries) will have only one level of page table, but a process with a segment for weather data might require two or three levels (and might use a large page size as well to minimize TLB misses). Some hypervisor segments might be direct-mapped, and require only the SDT level of mapping. In addition if the hypervisor is not paging the supervisor, it might direct map many supervisor segments.
Here are the details. After a TLB miss, the processor starts by using the 3 high bits of the segment field of the address to pick one of eight Segment Descriptor Table registers (sdtp[0] to sdtp[7]). The low 10 bits of the segment field are then an index into the table at the system virtual address in the specified register. The size field of the sdtp registers is used to bounds check the low 10 bits of the segment number before indexing, which allows each portion of the Segment Descriptor Table to be 0, 256, 512, 768, or 1024 entries in multiples of 256 entries (4 KiB to 16 KiB in multiples of 4 KiB). A size field of 0 disables the segment group; otherwise, the check is that svaddr57..56 < satp[svaddr60..58]11..9. If the bound check succeeds, the doubleword Segment Descriptor Entry is read from (satp[svaddr60..58]63..12 ∥ 012) | (svaddr57..48 ∥ 04) and this descriptor is used to bounds check the segment offset, and to generate a system virtual address. When TLB entries are created to speed future translations, they use the Address Space Identifier (ASID) specified in bits 8..0 the selected sdtp.
This method can be used to provide the functionality of two levels of other architectures (i.e. supervisor common using Global=1 and per-process using Global=0). A SecureRISC0 supervisor might simply use 256-1024 segments for supervisor common (with ASID=0), and 256-1024 segments the other segments for per-process mappings with dynamically assigned ASIDs as they are run. Such a system might set sdtp[7] at initialization, change sdtp[0] on process switch, and leave the other six groups unused (size=0).
Segment Descriptor Table Pointer registers are only readable and writable by ring 6. Other rings must use ring 6 calls to read and write these registers.
71 | 64 | 63 | 12 | 11 | 9 | 8 | 0 | ||||
240 | svaddress63..12 | size | ASID | ||||||||
8 | 52 | 3 | 9 |
71 | 64 | 63 | 33 | 32 | 31 | 30 | 29 | 28 | 24 | 23 | 21 | 20 | 18 | 17 | 15 | 14 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 0 | ||||||||
240 | 0 | G1 | G0 | PS | 0 | R3 | R2 | R1 | 0 | C | P | X | W | R | size | ||||||||||||||||||
8 | 31 | 2 | 2 | 5 | 3 | 3 | 3 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 6 |
Field(s) | Width | Description | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
size | 6 | Segment size is 2size bytes. Value 0 indicates an invalid segment (or should there be a V bit?). Values 1..11 and 62..63 are reserved. | |||||||||||||||||||||
R | 1 | Read permission | |||||||||||||||||||||
W | 1 | Write permission | |||||||||||||||||||||
X | 1 | Execute permission | |||||||||||||||||||||
P | 1 | Pointer permission (pointers with segment numbers are permitted) | |||||||||||||||||||||
C | 1 | CHERI Capability permission | |||||||||||||||||||||
R1, R2, R3 | 3 | Ring brackets as described elsewhere. | |||||||||||||||||||||
PS | 5 |
Page size:
|
|||||||||||||||||||||
G0 | 2 | Generation number of this segment for GC. | |||||||||||||||||||||
G1 | 2 | Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this segment traps and software lowers the G1 field. This feature is turned off by setting G to 3. |
71 | 64 | 63 | 12 | 11 | 0 | |||
240 | svaddress63..12 | 0 | ||||||
8 | 52 | 12 |
The interpretation of the system virtual address in SDE word 1 depends on the PS field in SDE word 0. For direct mapping (PS=31), it is simply the high bits of the System Virtual Address to combine with the segment offset, and bits size-1..0 must be zero. For paging, it is the address of the first-level page table, and bits PS+11..0 must be zero.
For direct mapping, the segment mapping consists of:
For segments ≤ 248 bytes, the offset is simply bits 47..0 of the local virtual address, and so the first check is that bits 47..size are zero, or equivalently that svaddr47..0 < 2size. For segments > 248 bytes, the offset extends into the segment number field, and no checking need be done during mapping (such sizes are however used during checking address arithmetic), but bits 60..size must be cleared before oring. The second check is that bits size−1..0 of the mapping are zero. The supervisor is responsible for providing the appropriate values in the Segment Descriptor Entries for each portion of segments > 248 bytes. Thus paging does not need to handle segments larger than 248 bytes (the SDT for such segments is in effect the first level of the page table).
When paging is used, the page tables can be one to four levels deep, with each level after the first using the same page size. The first level uses some fraction of the specified page size depending on the segment size. The hypervisor and supervisors are free to allocate only as much memory as needed for the first level page table. The following tables provide examples of how the local virtual address is used to index levels of the page table for several page and segment sizes. In the figures below, the 13‑bit segment number is split into a 3‑bit segment group (SG) number (used to pick the SDTP register) and the offset (SEG) within that group.
60 | 58 | 57 | 48 | 47 | 39 | 38 | 30 | 29 | 21 | 20 | 12 | 11 | 0 | |||||||
SG | SEG | V1 | V2 | V3 | V4 | offset | ||||||||||||||
3 | 10 | 9 | 9 | 9 | 9 | 12 |
60 | 58 | 57 | 48 | 47 | 30 | 29 | 21 | 20 | 12 | 11 | 0 | ||||||
SG | SEG | 0 | V1 | V2 | offset | ||||||||||||
3 | 10 | 18 | 9 | 9 | 12 |
60 | 58 | 57 | 48 | 47 | 46 | 36 | 35 | 25 | 24 | 14 | 13 | 0 | ||||||
SG | SEG | V1 | V2 | V3 | V4 | offset | ||||||||||||
3 | 10 | 1 | 11 | 11 | 11 | 14 |
60 | 58 | 57 | 48 | 47 | 37 | 36 | 20 | 19 | 0 | |||||
SG | SEG | V1 | V2 | offset | ||||||||||
3 | 10 | 11 | 17 | 20 |
The format of a segment page table is multiple levels, each level consisting of 72‑bit words with integer tags in the following format:
71 | 64 | 63 | 12 | 11 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |||
240 | svaddress63..12 | S | G | D | A | X | W | R | V | |||||||
8 | 52 | 4 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
Segments are meant as the unit of access control, but including Read, Write, and Execute permissions in the PTE might make ports of less aware operating systems easier.
Field(s) | Width | Description |
---|---|---|
V | 1 |
Valid: 0 ⇒ invalid, bits 63..1 available for software 1 ⇒ valid, bits 63..1 as described below |
R | 1 | Read permission |
W | 1 | Write permission |
X | 1 | Execute permission |
A | 1 |
Accessed: 0 ⇒ trap on any access (software sets A to continue) 1 ⇒ access allowed |
D | 1 |
Dirty: 0 ⇒ trap on any write (software sets D to continue) 1 ⇒ writes allowed |
G | 2 | Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this page traps and software lowers the G field. This feature is turned off by setting G to 3. |
SW | 4 | For software use |
svaddress63..12 | 52 |
For last level of page table, this is the translation For earlier levels, this is the pointer to the next level |
After 61‑bit Local Virtual Addresses (lvaddrs) are mapped to 64‑bit System Virtual Addresses (svaddrs), these 64‑bit System Virtual Addresses are mapped to 64‑bit System Interconnect Addresses (siaddrs). This mapping is similar, but not identical to the mapping above as it starts with a 14‑bit region number rather than one of eight 13‑bit segment numbers. There is one such mapping set by the hypervisor for the entire system using a Region Descriptor Table (RDT) at a fixed system address. The RDT may be hardwired, or read-only, on read/write by the hypervisor. For the maximum 16,384 regions, with 16 bytes for a RDT entry, the maximum size RDT is 256 KiB in size. A system configuration parameter allows the size of the RDT to be reduced when the full number of regions is not required (which is likely).
The format of the Region Descriptor Entries is a simplified version of Segment Descriptor Entries as shown below.
71 | 64 | 63 | 26 | 25 | 14 | 13 | 12 | 11 | 10 | 6 | 5 | 0 | |||||
240 | 0 | NDA | C | W | R | PS | size | ||||||||||
8 | 38 | 12 | 1 | 1 | 1 | 5 | 6 |
71 | 64 | 63 | 4 | 3 | 0 | |||
240 | System Interconnect Address | 0 | ||||||
8 | 60 | 4 |
The format of a region page table is multiple levels, each level consisting of 72‑bit words with integer tags in the same format as PTEs for local virtual to system virtual mapping, except there is no X or G fields.
Additional fields in the RDE may be useful. A bit indicating whether memory is tagged or not may be useful (tag-aware ports would provide a 240 tag on reads and check that the tag is 240 on writes). Another field might indicate whether encryption is used for the region, and if so, which of the port’s keys to use.
Reading Segment Descriptor Entries (SDEs) from the Segment Descriptor Table (SDT) and Region Descriptor Entries (RDEs) from the Region Descriptor Table (RDT) would typically be done through the L2 Data Cache. Since the L2 Data Cache is coherent with respect to this and other processors in the system, it is possible that L2 Data Cache might note that the TLB contains entries from the line, and send an invalidate to the TLB when the L2 line is invalidated. This might avoid the need for some TLB flushes. However, this requires the L2 to store the TLB location, which might require 8 bits per L2 tag. It is unclear whether this is worthwhile.
Ports into the system interconnect (Initiators) are limited in which regions they are permitted to access. The exact mechanism is TBD.
One possibility is that each Initiator is programmed by the hypervisor with two non-discretionary access control (NDAC) sets. One is for the Initiator’s TLB accesses, and the other is for accesses made by agents that the Initiator services. Non-discretionary access control is also stored as part of the Region Descriptors and cached in the Initiator’s TLB. The Initiator tests each access and rejects those that fail. Read access requires RegionCategories ⊆ InitiatorCategories and Write access requires RegionCategories = InitiatorCategories. For example, the Region Descriptor Table and the page tables those reference might have a Hypervisor bit that would prevent reads and writes from anything but Initiator TLBs. Processors would have non-discretionary access control sets per-ring. This would allow the same system to support multiple classification levels, e.g. Orange Book Secret and Top-Secret, with Top-Secret peripherals able to read both Secret and Top-Secret memory, but Secret peripherals denied access to Top-Secret memory.
Encryption might also be used to protect multiple levels of data in a system. For example, if Secret and Top-Secret data in memory are encrypted with different keys, and Secret Initiators are only programmed with that encryption key, then reading Top-Secret memory will result in garbage being read and writing Top-Secret data from a peripheral to Secret memory will result in that data being garbage to a processor or another peripheral with only the Secret key.
Because encryption results only in data being unintelligible, it is more difficult to debug. It may be desirable to employ both NDA sets and encryption.
An optional system feature of RDEs is to specify that the contents of the memory of the region should be protected with data at rest encryption. A separate table (perhaps in a secure enclave) would give the symmetric encryption key for encrypting and decrypting data transferred to and from the region and the system interconnect address would be used as the tweak. An obvious possibility is a 144‑bit block size cipher (e.g. a variant of AES based upon 9‑bit S‑boxes) used in Galois/Counter Mode (GCM), resulting in a 144‑bit authentication code, which would be stored in memory with the block. For SecureRISC0, with a cache line size of 8 words of 72 bits, this results in 576‑bit entities for data at rest protection, which becomes 720 bits with the authentication code, or eight 90‑bit words, which would be ECC protected with 8 check bits, producing a 98‑bit memory word. This would be an unusual width for standard DRAMs, and 9 ECC bits per 180 would also be unusual. Instead consider the 576 bits to be 9 words of 64 bits, use a more standard 128‑bit block cipher (e.g. standard AES) nine times, add the 128‑bit authentication, resulting in 704 bits, or eight words of 88 bits. Adding 8 bits of ECC results in 96 bits per memory word, which might use three 32‑bit or 64‑bit DRAM modules. Reads of encrypted memory would compute the 576 GCM xor bits during the read latency, resulting in a single xor when the data arrives at the system interconnect port boundary (either 96, 192, 384, or 576 bits per cycle). This xor would be much less time than the ECC check. Regenerating ECC for the decrypted data for writing into the L2 cache can be done by also precomputing the 64 bits to xor with the 8 ECC codes. Only if an ECC error is detected and corrected is it necessary to recompute the ECC before writing into the L2 cache. Writes would incur the GCM computation latency (primarily nine AES computations). Because the memory width and interconnect fabric would be sized for encryption, the only point in not encrypting a region would be to reduce write latency or to support non-block writes (it being impossible to update the authentication code when without doing a read, modify, write).
Structure | Description | |
---|---|---|
Basic Block Descriptor Fetch | ||
Predicted PC | 64‑bit lvaddr and ring | |
Predicted COUNT | 64‑bit integer | |
Predicted CSP | 64‑bit lvaddr and ring | |
L1 BB Descriptor TLB |
32 entry, 8‑way set associative, mapping lvaddr61..12 to siaddr63..12 in parallel with BB Descriptor Cache, filled from L2 Descriptor/Instruction TLB |
|
L2 BB Descriptor TLB |
256 entry, 8‑way set associative, filled from L2 Data Cache |
|
BB Descriptor Cache |
32 KiB (4096 descriptors), 8‑way set associative, 64‑byte line size, 8‑byte read, 64‑byte write, lvaddr11..3 index, ?siaddr35..12 tag?, 1.5 cycles latency, 2 cycles to predicted PC, filled from L2 Descriptor/Instruction Cache on miss and by prefetch |
|
Next Descriptor Index Predictor |
32×10+12, direct mapped lvaddr7..3 index, lvaddr19..8 tag, 1 cycle to predicted BB Descriptor Cache index, most recent flow change hits from BB Descriptor Cache |
|
Return Address Prediction | 64-entry (512 B) | |
Branch Predictor | ~16 KiB BATAGE | |
Indirect Jump/Call Predictor | ~16 KiB ITTAGE? | |
BB Fetch Output | 8‑entry BB Descriptor Queue of PC, BB type, fetch count, fetch siaddr63..2, instruction start mask, prediction to check | |
Instruction Fetch | ||
L1 Instruction Cache |
64 KiB, 4‑way set associative,
64‑byte line, read, write siaddr13..4 index, siaddr63..14 tag, 2-cycle latency, use 0*-2 times per basic block descriptor, so 0 or 2-3 cycles for entire BB instruction fetch, filled from L2 Descriptor/Instruction Cache on miss and prefetch, experiment with prefetch on BB descriptor fill * 0 fetches required if the previous 64B fetch covers the current one |
|
L2 Fetch | ||
L2 Combined Descriptor/Instruction Cache |
512 KiB, 8‑way set associative,
64‑byte line, read, write, siaddr15..6 index, siaddr63..16 tag, filled from system interconnect or L3 on miss and prefetch, evictions to L3 |
|
Instruction Fetch Output |
32‑entry Instruction Queue of 50‑bit decoded
instructions (16‑bit and 32‑bit instructions expanded) |
|
AR Execution Unit | ||
PC, CSP, COUNT | Committed values | |
Register renaming for ARs | 16×6 4‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical AR numbers and assigning d from AR free list. | |
Register renaming for XRs | 16×6 8‑read, 4‑write register file mapping 4‑bit a, b, fields to physical XR numbers and assigning d from XR free list. | |
Register renaming for BRs | 16×6 6‑read, 2‑write register file mapping 4‑bit a, b, c fields to physical BR numbers and assigning d from BR free list. | |
Register renaming for SRs | 16×6 8‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical SR numbers and assigning d from SR free list. | |
(VRs are not renamed) | ||
AR physical register file | 64×144 (+ parity) 6‑read, 4‑write | |
XR physical register file | 64×72 (+ parity) 6‑read, 4‑write | |
L1 Data TLB |
64 entry, 8‑way set associative, mapping lvaddr to siaddr, filled from L2 Data TLB |
|
L2 Data TLB |
256 entry, 8‑way set associative, filled from L2 Data Cache |
|
L1 Data Cache |
32 KiB, 4‑way set associative,
64‑byte line, 16‑byte read, 64‑byte write, lvaddr index, siaddr tag, write-thru, filled from L2 Data Cache on miss or prefetch |
|
Return Address Stack Cache | 64-entry (512 B), 64‑byte line size, no tags, fill and writeback to L2 Data Cache, subset and coherent with L2 Data Cache | |
L2 Data Cache |
512 KiB, 8‑way set associative,
64‑byte line, read, write, siaddr15..6 index, siaddr63..16 tag, write-back, filled from system interconnect or L3 on miss or prefetch, eviction to L3 |
|
AR Engine Output | 64‑entry BR/SR/VR operation queue | |
SR/VR Execution Unit (tends to run about a L2 Data Cache latency behind the AR Execution Unit) |
||
BR physical register file | 64×1 6‑read, 2‑write | |
SR physical register file | 64×72 (+ parity) 8‑read, 4‑write | |
VR register file | 16×72×128 (+ parity) 4‑read, 2‑write | |
Combined Fetch/Data | ||
System virtual address TLB |
128 entry, 8‑way set associative, mapping system virtual addresses to system interconnect addresses (maintained by hypervisor) |
|
L3 Eviction Cache serving L2 Instruction and L2 Data caches |
8 MiB, 8‑way set associative,
64‑byte line size, non-inclusive,
plus 8‑way set associative directory for sub caches, filled from evictions from L2 Instruction and Data caches |
Tag | Use |
---|---|
0 | Null pointers |
1..128 | Pointer to 1..128 words |
129..135 | Pointer to 1..7 bytes |
136 | Pointer to N words, with N stored at pointer − 8, and −N stored at pointer + N×8 |
137 | Unsized Pointer for C++ |
138..191 | Reserved |
192 | Pointer to Basic Block Descriptor |
193..199 | Reserved |
200 | CHERI Capability word 0 |
201..223 | Reserved |
224 | Lisp CONS |
225 | Lisp Function |
226 | Lisp Symbol |
227 | Lisp/Julia Structure |
228..229 | Reserved |
230 | Lisp Array |
231 | Lisp Vector |
232 | Lisp String |
233 | Lisp Bit-vector |
234 | Lisp Ratio, Julia Rational |
235 | Lisp/Julia Complex |
236 | Lisp Bigfloat |
237 | Lisp Bignum |
238 | 128‑bit integer |
239 | 128‑bit unsigned integer |
240 | 64‑bit integer |
241 | 64‑bit unsigned integer |
242 | Small integer types |
243 | Reserved |
244 | Double-precision floating-point |
245 | 8, 16, and 32‑bit floating-point |
246..251 | Reserved |
252 | CHERI capability word 1. Bits 143..136 of AR doublword store (used for save/restore and CHERI capabilities) |
253 | Basic Block Descriptor |
254 | Size header/trailer words |
255 | Trap on load or store |
![]() |
![]() |
Earl Killian <webmaster at securerisc.org> |
![]() |
SecureRISC0/index.html 2022-08-16 |