Version 0.1-draft-20231229
A Sandbox is an environment for executing untrusted programs in a manner that prevents them from corrupting the invoking environment. A sandbox might for example be used to execute code downloaded from an untrusted website, and compiled with a JIT compiler. Th untrusted code needs to make calls to its environment for services required, but should not be able to otherwise access that environment.
The basic idea for this proposal is to implement a single nested protection domain within the user-mode (or unprivileged if no user-mode) address space. Because this proposal is expected to function on processors both with and without virtual memory, it does not depend upon the RISC‑V virtual memory mechanisms, even though that would be a more general method for nested protection (e.g. extending the Sv39 PTE U-bit).
This proposal shares a similar mechanism to the VAkeys proposal, and if pursued, should be unified with that.
The sandboxed application may have 3-4 separate regions of the address space that it shares with the main application:
The goal is the sandboxed application would have access only to these regions, and no other, and that any attempt on its part to go beyond these will exit to the main application. To implement this, the following four or five unprivileged CSRs are proposed:
CSR | Purpose |
---|---|
sbox0 | Sandbox Code Matching and Permission |
sbox1 | Sandbox Heap Matching and Permission |
sbox2 | Sandbox Stack Matching and Permission |
sbox3 | Sandbox Library Matching and Permission (optional) |
sboxtrap | Sandbox Trap Handler |
The format of sbox0 to sbox3 is somewhat simplified but similar to the Machine mode PMP registers, but matches on the virtual address rather than physical. If the PC matches one of the sbox registers with execute access, then the program is executing in sandbox mode. Thus the main program can transfer to sandbox mode simply with a branch, jump, call, etc. The supervisor can return to sandbox mode with a SRET. If the PC does not so match, then it is executing the main application program. When in sandbox mode, any attempt to fetch from an address not matching a sbox register with execute access, or executing a permitted instruction, instead transfers to the virtual address held in the sboxtrap register, and exits sandbox mode. Typically such an exit would be a procedure call to request services of the main application, and the a return will naturally resume sandboxed operation. (The return address and parameters would of course have to be checked by the handler in the sboxtrap register for validity.) Also in sandbox mode, any load or store to a virtual address not matched by one of the sbox registers with the appropriate read permission (for a load) or write permission (for a store) will transfer to the sboxtrap address. How to distinguish calls and access violations is TBD, but probably does not merit a full exception PC and cause code. Any exit from the sandbox other than a call is likely to have the sandbox killed, and calls have a return address in x1. Perhaps add an error bit to each sbox to indicate the error occurred there, or use the low bits of sboxtrap, but anything fancier is not warranted.
Most basic RISC-V unprivileged instructions are permitted in sandbox mode, but not ECALL, EBREAK, CSR*, or similar instructions.
For user-mode sandboxing (as opposed to non-virtualized ones), virtual memory exceptions (e.g. page faults and access faults) continue to trap to the supervisor, which does not need to know about the sandbox. When both an page fault or access fault and a sandbox trap could occur, the sandbox trap has precedence.
The format of sboxtrap is simply a 4‑byte aligned virtual address:
63 | 2 | 1 | 0 | |
Virtual Address | 0 | |||
62 | 2 |
The format of sbox0 to sbox3 is as follows:
63 | 11 | 10 | 4 | 3 | 1 | 0 | ||||||
vaddr63..12+RS | 2RS | 0 | XWR | V | ||||||||
52−RS | 1+RS | 7 | 3 | 1 |
Field | Width | Bits | Description |
---|---|---|---|
V | 1 | 0 |
Valid 0 ⇒ Matching disabled 1 ⇒ Valid, bits 63:1 interpreted as follows |
XWR | 3 | 3:1 | Execute, Write, and Read permissions as in Sv39 PTEs. Applies to instruction fetch, loads, and stores in sandbox mode. Ignored when not in sandbox mode. |
0 | 7 | 10:4 | Reserved for future use. |
2RS | 1+RS | 11+RS:11 | NAPOT encoding of region size. The region is 2RS+12 bytes. |
vaddr63:12+RS | 52−RS | 63:12+RS | The virtual address bits to match on valid entries in sandbox mode. |
A state enable bit would have to be added for these registers to be kept clear on operating systems that don’t context switch these registers. If this proposal proceeds, details will follow.
When all sbox register valid bits are 0, the entire extension is disabled and has no effect on unprivileged operation.
Matching against sbox registers may be done in parallel without priority; the operation in case of multiple match is reserved, and software should avoid such a situation.
To illustrate how this might be implemented (without constraining the
implementation), consider that in parallel with the L1 instruction cache
access the PC is compared against the subset
of sbox registers that are valid and have
Execute access. The sbox registers act as sort
of micro-TLB, with the only result being
the XWR and present indications. A match
results in setting the sandbox bit for the fetched instruction(s) and
this bit is pipelined along with the instruction(s) and becomes a term
in the illegal instruction exception logic (along with
various misa bits). Further down the pipeline,
branches, jumps, loads and stores also consult this micro-TLB
on
the effective address when the sandbox bit is set, and a transfer
to sboxtrap results if either there is no
match, or if on a match, Execute for a branch or jump, Read (for a load)
or Write (for a store) permission is not
granted. PC increments also test whether the
high 52−RS bits of
the PC remain unchanged, and if not this also
results in PC ← sboxtrap.
This proposal calls for separate heap and stack regions simply so that stack overflow can be detected. If this is not required, one register can be eliminated.
The library sbox provides support to the sandboxed application that runs in sandbox mode, but which might be a standard DLL mapped into the address space that is separate from the sandboxed application code. If this is not required, one register can be eliminated.
It is possible that the sbox registers could be generalized to serve as generational GC support. If this is interesting, then this proposal might be extended.
If required, a few CSR reads and writes, e.g. to floating-point CSRs, might be allowed, but it would be simpler to just avoid this in sandbox code.
Some limit on the value of RS might be appropriate. Would RS ≤ 20 (i.e. limiting regions to 232 bytes) be sufficient?
It may be that a dirty bit should be set on writes to the sbox and sboxtrap registers. If the dirty bit is clear, the operating system would not need to save these values on context switch.
Garbage collection (GC) is the automatic reclaimation of allocated memory by tracing all reachable allocations and freeing the remainder. Allocation is done in areas, which are for user-visible segregation of different-purpose allocations to different portions of memory. Areas consist of 1-4 generations, each generation consisting of a number of some data structures and regions that are used to implement incremental GC. The purpose of regions is to bound the timing of certain GC operations making program delays not proportional to memory size.
A similar mechanism of virtual address NAPOT-matching CSRs can be used
to support efficient garbage collection. Imagine two, three, or four
CSRs that match a region of the virtual address space, corresponding to
generations. Possibly two generations is sufficient, but could be
extended to as three or four if that proves necessary. Quoting
Wikipedia, It has been empirically observed that in many programs,
the most recently created objects are also those most likely to become
unreachable quickly (known as infant mortality or the generational
hypothesis).
This is exploited by allocating in an emphemeral area,
collecting that frequently, and things that persist there are eventually
migrated to areas collected less frequently. When collecting, one
starts with the roots and traces all reachable data, either making the
non-reachable locations available for subsequent reuse or (better)
compacting the reachable data to eliminate fragmentation. With
generations, the start of tracing for a given generation is all pointers
from older generations, which requires some way that such pointers be
recorded. One method is a write barrier
to detect when a pointer
to a younger generation is stored to an older generation, so that this
pointer can be noted as a root when tracing the younger generation.
Alternatively, it may be possible to simply note what older generation
blocks contain pointers to newer ones, and such blocks are scanned when
tracing begins (the tradeoff being the cost of noting pointer by pointer
vs. scanning a whole block). Once a block is noted, subsequent stores
to that block need not be noted. Detection might be via an exception,
where the handler records the pointer from older to newer, or it might
be via a branch. The branch significantly increases code size and
execution time in the usual case, whereas the exception is typically
more costly in execution time in the uncommon case.
As an example, consider then adding two CSRs, GC0 and GC1, where GC0 matches the most recent allocation area and GC1 matches the second youngest allocation area. What is needed is a test, triggering either a branch or an exception, when storing a value that matches GC0 to an address not matching GC0, or when storing a value that matches GC1 to an address not matching GC0 or GC1. This could be a GCCHK instruction that computes a 0 or 1 result given a value to be stored and an address. This can be branched on prior to the store or after, depending on the algorithm. To save one instruction, a branch instruction given the value and address could be defined. For detection by exception, this could be implemented with a SG instruction that is similar to SD, but which takes an exception based on the above. If more than two generations are required, adding GC2 and perhaps GC3 is feasible, with the obvious generalization. The primary issue with SG is how to handle the exception efficiently in user mode.
The simplest GC algorithms are sometimes called stop the world
because when the allocation area fills up, the program is suspended
while garbage collection reclaims memory, and when GC completes, the
program resumes. This may introduce unacceptable delays, and so most GC
today is required to be incremental
or real-time
. In this
case, when the allocation area is reaches some limit, GC tracing begins
while the program continues running and continues allocating (new
allocation is marked as reachable). When tracing completes, the next
phase depends upon the GC algorithm, but for example a compacting GC
will want to move objects to new addresses. One way to do this without
significantly delaying continued program execution is to have two
virtual addresses for each physical page that differ by a single bit and
access is permitted or denied based on this bit and the status of the
compaction for a region. All of the root pointers (e.g. the stack) are
replaced with the address where the data will eventually live after
compaction, but with this bit toggled, but access to these virtual
addresses is disabled. From now on the running application
will see
only compacted addresses, but if it attempted to read
from such an address before the relocation occurs, this is detected and
the GC relocates that small region of memory so that program execution
can continue (doing it early). This is the incremental
aspect of
such a garbage collector; the program runs slower during this phase
because each reference may trigger some relocation, but the slowdown is
bounded, and not proportional to memory size (only the fixed region
size). The program only waits for a fixed amount of work to be done for
the things it references, not all of memory. The reason for two virtual
addresses mapping to the same physical memory is so that the application
is prevented from referencing uncompacted memory, but the GC can access
it with the bit toggled.
The permission to reference a region can be implemented solely with page permissions, but it may somewhat expensive to change those, and in that case it is possible to implement the permission with a virtual address check. Page permissions are more scalable, because virtual address checks are likely to have limited bits for specification, but to illustrate pick a bit of the virtual address that allows mapping two virtual addresses to the same physical address (either due to duplicating the translation or because it is ignored by the processor on translation). This bit could be checked independently to permit or deny an access based on the contents of a CSR as an alternative to changing page permissions. For example, for 64 regions of 16384 pointers (131072 bytes), bits 22:17 could be used to select one bit from a CSR giving the permitted value of the selected bit for an access to go ahead. Changing the status of a region would simply involve toggling a bit in the CSR. Again, the permission test could be implemented as an instruction that returns a 0 or 1 value, a branch instruction, or a LG instruction that take an exception. There might be 1-4 CSRs (representing 64-256 regions) of permissions per generation. The issue with the CSR approach is the limitation on generation size; it is not scalable.
The primary issue with any GC acceleration is that it will be dependent on a fast exception mechanism, preferably one that is handled in user-mode, and that becomes expensive in terms of CSRs.
<webmaster at securerisc.org> | |||
2023-12-29 |