Proposal for Embedded Sandboxing

Version 0.1-draft-20231229

Table of Contents

Introduction

A Sandbox[wikilink] is an environment for executing untrusted programs in a manner that prevents them from corrupting the invoking environment. A sandbox might for example be used to execute code downloaded from an untrusted website, and compiled with a JIT compiler. Th untrusted code needs to make calls to its environment for services required, but should not be able to otherwise access that environment.

The basic idea for this proposal is to implement a single nested protection domain[wikilink] within the user-mode (or unprivileged if no user-mode) address space. Because this proposal is expected to function on processors both with and without virtual memory, it does not depend upon the RISC‑V virtual memory mechanisms, even though that would be a more general method for nested protection (e.g. extending the Sv39 PTE U-bit).

This proposal shares a similar mechanism to the VAkeys proposal, and if pursued, should be unified with that.

Pros

Cons

Proposal

The sandboxed application may have 3-4 separate regions of the address space that it shares with the main application:

The goal is the sandboxed application would have access only to these regions, and no other, and that any attempt on its part to go beyond these will exit to the main application. To implement this, the following four or five unprivileged CSRs are proposed:

Embedded Sandboxing CSRs
CSR Purpose
sbox0 Sandbox Code Matching and Permission
sbox1 Sandbox Heap Matching and Permission
sbox2 Sandbox Stack Matching and Permission
sbox3 Sandbox Library Matching and Permission (optional)
sboxtrap Sandbox Trap Handler

The format of sbox0 to sbox3 is somewhat simplified but similar to the Machine mode PMP registers, but matches on the virtual address rather than physical. If the PC matches one of the sbox registers with execute access, then the program is executing in sandbox mode. Thus the main program can transfer to sandbox mode simply with a branch, jump, call, etc. The supervisor can return to sandbox mode with a SRET. If the PC does not so match, then it is executing the main application program. When in sandbox mode, any attempt to fetch from an address not matching a sbox register with execute access, or executing a permitted instruction, instead transfers to the virtual address held in the sboxtrap register, and exits sandbox mode. Typically such an exit would be a procedure call to request services of the main application, and the a return will naturally resume sandboxed operation. (The return address and parameters would of course have to be checked by the handler in the sboxtrap register for validity.) Also in sandbox mode, any load or store to a virtual address not matched by one of the sbox registers with the appropriate read permission (for a load) or write permission (for a store) will transfer to the sboxtrap address. How to distinguish calls and access violations is TBD, but probably does not merit a full exception PC and cause code. Any exit from the sandbox other than a call is likely to have the sandbox killed, and calls have a return address in x1. Perhaps add an error bit to each sbox to indicate the error occurred there, or use the low bits of sboxtrap, but anything fancier is not warranted.

Most basic RISC-V unprivileged instructions are permitted in sandbox mode, but not ECALL, EBREAK, CSR*, or similar instructions.

For user-mode sandboxing (as opposed to non-virtualized ones), virtual memory exceptions (e.g. page faults and access faults) continue to trap to the supervisor, which does not need to know about the sandbox. When both an page fault or access fault and a sandbox trap could occur, the sandbox trap has precedence.

The format of sboxtrap is simply a 4‑byte aligned virtual address:

Sandbox Trap Handler
63 2 1 0
Virtual Address 0
62 2

The format of sbox0 to sbox3 is as follows:

Sandbox Match and Permission
63 11 10 4 3 1 0
vaddr63..12+RS 2RS 0 XWR V
52−RS 1+RS 7 3 1
Fields of sbox
Field Width Bits Description
V 1 0 Valid
0 ⇒ Matching disabled
1 ⇒ Valid, bits 63:1 interpreted as follows
XWR 3 3:1 Execute, Write, and Read permissions as in Sv39 PTEs. Applies to instruction fetch, loads, and stores in sandbox mode. Ignored when not in sandbox mode.
0 7 10:4 Reserved for future use.
2RS 1+RS 11+RS:11 NAPOT encoding of region size. The region is 2RS+12 bytes.
vaddr63:12+RS 52−RS 63:12+RS The virtual address bits to match on valid entries in sandbox mode.

A state enable bit would have to be added for these registers to be kept clear on operating systems that don’t context switch these registers. If this proposal proceeds, details will follow.

When all sbox register valid bits are 0, the entire extension is disabled and has no effect on unprivileged operation.

Matching against sbox registers may be done in parallel without priority; the operation in case of multiple match is reserved, and software should avoid such a situation.

Implementation Notes

To illustrate how this might be implemented (without constraining the implementation), consider that in parallel with the L1 instruction cache access the PC is compared against the subset of sbox registers that are valid and have Execute access. The sbox registers act as sort of micro-TLB, with the only result being the XWR and present indications. A match results in setting the sandbox bit for the fetched instruction(s) and this bit is pipelined along with the instruction(s) and becomes a term in the illegal instruction exception logic (along with various misa bits). Further down the pipeline, branches, jumps, loads and stores also consult this micro-TLB on the effective address when the sandbox bit is set, and a transfer to sboxtrap results if either there is no match, or if on a match, Execute for a branch or jump, Read (for a load) or Write (for a store) permission is not granted. PC increments also test whether the high 52−RS bits of the PC remain unchanged, and if not this also results in PC ← sboxtrap.

Design Considerations

Number of sbox Registers

This proposal calls for separate heap and stack regions simply so that stack overflow can be detected. If this is not required, one register can be eliminated.

The library sbox provides support to the sandboxed application that runs in sandbox mode, but which might be a standard DLL mapped into the address space that is separate from the sandboxed application code. If this is not required, one register can be eliminated.

Support for Garbage Collection

It is possible that the sbox registers could be generalized to serve as generational GC support. If this is interesting, then this proposal might be extended.

Permitted Operations

If required, a few CSR reads and writes, e.g. to floating-point CSRs, might be allowed, but it would be simpler to just avoid this in sandbox code.

Limits to Simplify Hardware

Some limit on the value of RS might be appropriate. Would RS ≤ 20 (i.e. limiting regions to 232 bytes) be sufficient?

Dirty Bit for CSR Writes

It may be that a dirty bit should be set on writes to the sbox and sboxtrap registers. If the dirty bit is clear, the operating system would not need to save these values on context switch.

Garbage Collection

Garbage collection (GC) is the automatic reclaimation of allocated memory by tracing all reachable allocations and freeing the remainder. Allocation is done in areas, which are for user-visible segregation of different-purpose allocations to different portions of memory. Areas consist of 1-4 generations, each generation consisting of a number of some data structures and regions that are used to implement incremental GC. The purpose of regions is to bound the timing of certain GC operations making program delays not proportional to memory size.

Generational GC

A similar mechanism of virtual address NAPOT-matching CSRs can be used to support efficient garbage collection. Imagine two, three, or four CSRs that match a region of the virtual address space, corresponding to generations. Possibly two generations is sufficient, but could be extended to as three or four if that proves necessary. Quoting Wikipedia, It has been empirically observed that in many programs, the most recently created objects are also those most likely to become unreachable quickly (known as infant mortality or the generational hypothesis). This is exploited by allocating in an emphemeral area, collecting that frequently, and things that persist there are eventually migrated to areas collected less frequently. When collecting, one starts with the roots and traces all reachable data, either making the non-reachable locations available for subsequent reuse or (better) compacting the reachable data to eliminate fragmentation. With generations, the start of tracing for a given generation is all pointers from older generations, which requires some way that such pointers be recorded. One method is a write barrier to detect when a pointer to a younger generation is stored to an older generation, so that this pointer can be noted as a root when tracing the younger generation. Alternatively, it may be possible to simply note what older generation blocks contain pointers to newer ones, and such blocks are scanned when tracing begins (the tradeoff being the cost of noting pointer by pointer vs. scanning a whole block). Once a block is noted, subsequent stores to that block need not be noted. Detection might be via an exception, where the handler records the pointer from older to newer, or it might be via a branch. The branch significantly increases code size and execution time in the usual case, whereas the exception is typically more costly in execution time in the uncommon case.

As an example, consider then adding two CSRs, GC0 and GC1, where GC0 matches the most recent allocation area and GC1 matches the second youngest allocation area. What is needed is a test, triggering either a branch or an exception, when storing a value that matches GC0 to an address not matching GC0, or when storing a value that matches GC1 to an address not matching GC0 or GC1. This could be a GCCHK instruction that computes a 0 or 1 result given a value to be stored and an address. This can be branched on prior to the store or after, depending on the algorithm. To save one instruction, a branch instruction given the value and address could be defined. For detection by exception, this could be implemented with a SG instruction that is similar to SD, but which takes an exception based on the above. If more than two generations are required, adding GC2 and perhaps GC3 is feasible, with the obvious generalization. The primary issue with SG is how to handle the exception efficiently in user mode.

Concurrent GC

The simplest GC algorithms are sometimes called stop the world because when the allocation area fills up, the program is suspended while garbage collection reclaims memory, and when GC completes, the program resumes. This may introduce unacceptable delays, and so most GC today is required to be incremental or real-time. In this case, when the allocation area is reaches some limit, GC tracing begins while the program continues running and continues allocating (new allocation is marked as reachable). When tracing completes, the next phase depends upon the GC algorithm, but for example a compacting GC will want to move objects to new addresses. One way to do this without significantly delaying continued program execution is to have two virtual addresses for each physical page that differ by a single bit and access is permitted or denied based on this bit and the status of the compaction for a region. All of the root pointers (e.g. the stack) are replaced with the address where the data will eventually live after compaction, but with this bit toggled, but access to these virtual addresses is disabled. From now on the running application will see only compacted addresses, but if it attempted to read from such an address before the relocation occurs, this is detected and the GC relocates that small region of memory so that program execution can continue (doing it early). This is the incremental aspect of such a garbage collector; the program runs slower during this phase because each reference may trigger some relocation, but the slowdown is bounded, and not proportional to memory size (only the fixed region size). The program only waits for a fixed amount of work to be done for the things it references, not all of memory. The reason for two virtual addresses mapping to the same physical memory is so that the application is prevented from referencing uncompacted memory, but the GC can access it with the bit toggled.

The permission to reference a region can be implemented solely with page permissions, but it may somewhat expensive to change those, and in that case it is possible to implement the permission with a virtual address check. Page permissions are more scalable, because virtual address checks are likely to have limited bits for specification, but to illustrate pick a bit of the virtual address that allows mapping two virtual addresses to the same physical address (either due to duplicating the translation or because it is ignored by the processor on translation). This bit could be checked independently to permit or deny an access based on the contents of a CSR as an alternative to changing page permissions. For example, for 64 regions of 16384 pointers (131072 bytes), bits 22:17 could be used to select one bit from a CSR giving the permitted value of the selected bit for an access to go ahead. Changing the status of a region would simply involve toggling a bit in the CSR. Again, the permission test could be implemented as an instruction that returns a 0 or 1 value, a branch instruction, or a LG instruction that take an exception. There might be 1-4 CSRs (representing 64-256 regions) of permissions per generation. The issue with the CSR approach is the limitation on generation size; it is not scalable.

The primary issue with any GC acceleration is that it will be dependent on a fast exception mechanism, preferably one that is handled in user-mode, and that becomes expensive in terms of CSRs.

Glossary

GC
An acronym for Garbage Collection[wikilink], which in a computer refers to having algorithms reclaim memory allocated and no longer used, rather than having explicit free operations that introduce the potential to continue using the memory after the free.
Generational GC
Generational Garbage Collection[wikilink] separates data by age so that older generations, which typically have fewer references to recent recent generations, need not be fully scanned to determine which recent allocations may be reclaimed.
JIT
An acronym for Just-In-Time[wikilink] compilation.
NAPOT
An acronym for Naturally Aligned Power-of-2[PDF] (Chapter 5 “Svnapot” Standard Extension for NAPOT Translation Contiguity).
NAPOT refers to things of size 2N bytes be aligned to that size (i.e. the address is a multiple of 2N, viz. the least significant N bits of the address are zero). In RISC‑V NAPOT sizes are represented with a repeated 1s (e.g. in PMP entries) or repeated 0s (e.g. in PTEs) in the lower address bits, then the opposite bit, and then from that point actual address bits, which allows the address and size to encoded in 1 bit more than the address alone.
PMP
An acronym for Physical Memory Protection[PDF] (section 3.7.1 Physical Memory Protection CSRs), which is a mechanism for providing Read, Write, and Execute permission for ranges of physical memory addresses independent of the translation mechanism, and which is logically anded with those permissions.

Valid XHTML 1.1 Valid CSS!
<webmaster at securerisc.org>
No Junk Email!
2023-12-29