Proposal for RISC‑V GC

Version 0.1-draft-20240722

Table of Contents
Introduction
- Garbage Collection Algorithms
  - MoonGC
  - ZGC
Access Restriction
Generational GC
Opcode Assignment
Design Choices
Future Work
RISC-V Glossary

Introduction

This proposal assumes some familiarity with Garbage Collection (GC) and RISC‑V both. For those not as familiar with RISC‑V, a RISC‑V Glossary is provided at the end for some particular concepts.

RISC‑V GC is an unprivileged proposal for supporting Garbage Collection (GC) on RISC‑V. GC is required for many important programming languages, such as Java, Javascript, Python, and Julia. A goal of this proposal, at least in its full instantiation, is to make languages using GC nearly as efficient as those with explicit deallocation. To that end, it needs at least to support memory access with single load and store instructions, rather than a sequence of instructions that increases code size and reduces performance.

There are two aspects of GC that need to be addressed. The first is making it incremental and concurrent to prevent the long pauses associated with Stop The World (STW) GC [wikilink] . Henry Baker’s solution to this problem was published in 1978, and most solutions are variants and refinements of this old work. The second is supporting generations, which makes GC more efficient by concentrating effort where it matters most (recently allocated data).

This proposal begins by introducing two GC algorithms to be targeted (though other algorithms are expected to be supported by this proposal as well) and then goes on to introduce the extension that makes these algorithms more efficient by providing additional (beyond PMP and paging) access control for loads and stores before translation (if any) of the effective address. It then explains how generational GC might be incorporated, using an extension of the same mechanism.

Garbage Collection Algorithms

To understand the mechanisms proposed, it is helpful to first understand how they would be used by various GC algorithms. This proposal was originally developed with a GC algorithm proposed by David Moon, which herein is dubbed MoonGC. MoonGC originally used virtual memory to implement Baker GC’s fromspace and tospace concepts, but has been modified here to take advantage of the proposed virtual address restriction mechanism for very low overhead, and thus improved performance (thereby eliminating the need for separate fromspace and tospace). However, the proposed virtual address restriction is also suitable for use by other GC algorithms. In particular, one important algorithm, ZGC, could use virtual address restriction to implement its evacuation algorithm, and some discussion of ZGC follows the exposition of how MoonGC works with this proposal. ZGC also uses virtual memory to afford multiple views of physical memory. One goal of this proposal is to avoid the use of virtual memory for GC, which should result in significant performance improvement. At a high-level the primary difference between the two algorithms is that MoonGC uses compaction within a region whereas ZGC use copying to new regions to effect evacuation of highly fragmented regions. This proposal does not argue which of this approaches is better, but simply aims to support both. At a lower level the differences are the mechanisms ZGC uses to be efficient on architectures that do not provide GC support, including:

ZGC uses pointer coloring and virtual memory and to implement barriers on architectures that do not support barriers directly.
ZGC uses load barriers rather than use barriers to effect pointer coloring cleanup.

The following sections describe these two algorithms. Shenandoah GC is third GC algorithm. It uses Brooks’ forwarding pointers to work well on architectures without GC support. Since this proposal is designed to provide that support without forwarding pointers, Shenandoah is not further considered.

Still other Garbage Collection algorithms use write barriers rather than read barriers. Support for these is included in this proposal for completeness, but could be dropped if determined to be unnecessary.

MoonGC

David Moon, architect of Symbolics Lisp Machines, kindly offered suggestions on Garbage Collection (GC). Herein his algorithm is dubbed MoonGC. Moon began by observing the following:

Compacting garbage collection is better than copying garbage collection because it uses physical memory more efficiently.

Compacting garbage collection is better than non-moving garbage collection and C-style heap allocation because it does not cause fragmentation.

First, divide objects into small and large. Large objects are too large to be worth the overhead of compacting, larger than a few times the page size. Large objects are allocated from a heap and never change their address. The garbage collector frees a large object when it is no longer in use. By putting each large object on its own pages, physical memory is not fragmented and the heap only has to track free space at page granularity. Virtual address space gets fragmented, but it is plentiful so that is not a problem.

Small objects are allocated consecutively in fixed-size regions by advancing a per-region fill pointer until it hits the end of the region or there is not enough free space left in that region; at that point allocation switches to a different region. The region size is a small multiple of the page size. The allocator chooses the region from among a set of regions that belong to a user-specified area. Garbage collection will compact all in-use objects in a region to the low-address end of that region, consolidating all the free space at the high-address end where it can easily be reused for new allocations.

This proposal now uses incremental region for what MoonGC called simply region above. Before continuing, this proposal introduces this and other terminology in the next section.

One other advantage of compaction, not mentioned above, is that it provides a natural mechanism for determining the long-lifetime data in ephemeral generations: it is the data compacted to the lowest addresses.

MoonGC Terminology

Allocation is done in areas, which are for user-visible segregation of different-purpose allocations to different portions of memory. Areas consist of 1-4 generations, each generation consisting of some data structures and many small independent incremental regions that are used to implement incremental GC. The purpose of the incremental regions is to bound the timing of certain GC operations making program delays not proportional to memory size but only to incremental region size. When the application program needs to access an incremental region that has not been processed, the application thread invokes the GC code to process it immediately, and then resumes the application. The incremental region size is programmable and would be chosen to be small enough that the delay in processing it is acceptable to application performance, but large enough that its overhead is not excessive. A group of incremental regions is called a macro region, and a generation might be one or more macro regions. Macro regions are further divided into those for small and large objects, which use different algorithms for their incremental regions.

The above is briefly summarized below:

Address restriction: A per-thread mechanism to prevent access to incremental regions when compaction and/or translation is required.
Area: Programmers allocate objects in areas, which provides grouping. An area might consist of a data structure and several regions of the heap, one per generation. Areas are called arenas in jemalloc, and that term could be used here instead.
Generation: Generations group data into four lifetimes, from ephemeral (generation 0) to long-lived (generation 3), with generations 1 and 2 having intermediate object lifetimes. Generations consist of small-object macro regions and large-object macro regions.
Incremental Region (Iregion): To minimize the time the application is paused by the GC algorithm, objects are stored in incremental regions that can be compacted quickly in response to an access by the application; once the incremental region is compacted and/or translated, the application continues.
Large Object Region: Large objects (greater than a small multiple of the page size) are allocated as many pages as required, and are never moved (but pointers in large objects are translated). When they are no longer referenced, GC frees the pages they occupied.
Macro Region (Mregion): Incremental regions are small to minimize processing time, and so cannot hold all of the application data. Two groups of incremental regions provide the capacity required by the application in a generation. Macro regions are identified by the address restriction mechanism, and have individual access control for the incremental regions they contain. There are separate macro regions for large and small objects.
Small Object Region: Small objects (less than a small multiple of the page size) benefit from compaction, and are allocated sequentially from free space.

MoonGC, as originally presented, is a four phase algorithm to implement the above using only virtual memory and changing page permissions. The following adapts MoonGC to take advantage of the address restriction feature described below, as using virtual memory protection changes is costly (e.g. requiring TLB flushes). The restriction allows GC to deny application threads access to incremental regions when they are in an inconsistent state. The following also makes other minor changes so that the exposition below is new. The credit goes to David Moon, but problems and bugs are likely the result of these changes and exposition.

The application threads run concurrently with the GC threads, except in phase 3 (the stack scan). Application threads may be slowed during phase 4 as will be explained. The four phases of MoonGC are as follows:

Mark phase. Mark data reachable from roots in incremental region bitmaps. This phase is concurrent with the application. New allocations also mark the bitmaps.
Preparation phase. Process the bitmaps to prepare for small object relocation and free large object pages. This phase is concurrent with the application.
Relocation phase. Translate all roots, including the stack and the pointers from longer-lived generations, converting pre-compaction addresses to post-compaction addresses. Restrict the application access to all small and large object regions using address restriction. This is either via a trap handled at the same privilege level, or an instruction producing a value to branch on. It is therefore fairly low-overhead (discussed below). The application is paused during this phase, but this phase takes a short time and is not proportional to memory size, only the size of the stack and other roots. After this phase, the roots, stack, and all accessible memory contain only compacted addresses, and the application works only with such addresses. Application access to memory with pre-compaction addresses is restricted (GC threads continue to have access).
Compaction phase. This phase is concurrent with the application and occurs primarily in the GC threads, but application threads may join the work when needed. The compaction phase goes through all small object incremental regions and moves marked objects from their pre-compaction to post-compaction address, and also translates the pointers contained in the objects as well. It also goes through all large objects and translates the pointers contained therein. Once compaction is completed for an incremental region, the permission for that incremental region is enabled for the application. If an application thread tries to access a disabled region, it receives a trap. If compaction has already started on this region, the trap handler will simply wait. If compaction has not started, the trap handler switches to compaction and translation of the incremental region (only translation for large object regions). When an application thread joins the compaction threads, it will temporarily enable access, and then restore access when it finishes its incremental region, and then return to application work, which now that the region is compacted, will no longer be restricted. The time spent in the application thread doing compaction and translation is proportional to incremental region size and not memory size, bounding the pause.

Occasionally an extra phase of the algorithm might compact two incremental regions into one. Still additional phases might migrate objects from a frequent generation to a less frequent one.

The efficiency of translating pre-compaction to post-compaction addresses is critical. The original MoonGC proposal recognized that this time is probably limited by data cache misses, and used the preparation phase to convert the bitmaps into a relocation tree that would require only three cache block accesses per translation with binary searching. The following modification is proposed to reduce this to just two cache blocks by making extensive use of population count [wikilink] (popcount) operations.

Within a small object incremental region, the post-compaction offset of an object is the number of mark bits set in the incremental region bitmap for objects up to but not including that object. For translation, summing the popcount on all the doublewords in the bitmap prior to the doubleword for the pre-compaction address would touch too many cache blocks, so in phase 2 (preparation) compute the popcounts of each bitmap cache block and store them for lookup in phases 3 and 4. Each translation is then one popcount cache block access and one bitmap cache block access. For a small object incremental region holding N objects and a cache block size of 512 bits (64 B), the number of bitmap cache blocks B is ⌈N/512⌉. Store 0 in summary[0]; store popcount(bitmap[0..511]) in summary[1]; store summary[1]+popcount(bitmap[512..1023]) in summary[2]; and so on … and finally store summary[B-2]+popcount(bitmap[N-1024..N-511]) in summary[B-1]. If N ≤ 65536 then the summary count array elements fit in 16 bits, and so the size of the summary array is ⌈B/32⌉ cache blocks, and if N ≤ 16384 the summary array fits in only one cache block. To translate from the pre-compaction offset to the post-compaction offset in phases 3 and 4, simply take the ⌊offset/512⌋ as the index into this array to get the number of objects before the bitmap cache block. Now read the bitmap cache block. Add the popcount of the 1-8 doublewords up to the object of interest (using a mask on the last doubleword read) to the lookup value. This sum is the post-compaction offset in the small object incremental region. If eight popcounts are too costly, then the summary popcount array may be doubled in size to cover just four doublewords, or a vector popcount reduction instruction might be added to make this even more efficient.

As an example, to illustrate the above, consider using the access restriction mechanism described below to select an incremental region size of 128 KiB (131072 B). An object pointer is converted to its containing incremental region by simply clearing the low 17 bits. There are 16104 doublewords (2013 cache blocks) of object store (98.29%), which are stored starting at offset 0 in the incremental region. The bitmap summary popcounts are 64 bytes starting at 128832. Bitmaps are 2016 bytes (31.5 cache blocks) starting at 128896. Finally there are 160 bytes (20 doublewords, 2.5 cache blocks) of incremental region overhead for locks, freelists, etc. available starting at 130912. To go from the pointer to its bitmap byte, add bits 16:6 to the region pointer plus 128896 and the bit is given by bits 5:3.

Incremental region layout examples
Mregion		Iregion	Objects		Summary		Bitmap		Other
S	size	Dws	Dws	%	Dws	%	Dws	%	Dws	%
0	8 KiB	8			0		1
1	16 KiB	16			0		1
2	32 KiB	32	24	75.00	0		1	3.12	8	25.00
3	64 KiB	64	56	87.50	0		1	1.56	8	12.50
4	128 KiB	128	112	87.50	0		2	1.56	14	10.94
5	256 KiB	256	240	93.75	0		4	1.56	12	4.69
6	512 KiB	512	480	93.75	0		8	1.56	23	4.49
7	1 MiB	1024	984	96.09	1	0.10	16	1.56	23	2.25
8	2 MiB	2048	1992	97.27	1	0.05	32	1.56	23	1.12
9	4 MiB	4096	4008	97.85	2	0.05	63	1.54	23	0.56
10	8 MiB	8192	8040	98.14	4	0.05	126	1.54	22	0.27
11	16 MiB	16384	16104	98.29	8	0.05	252	1.54	20	0.12
12	32 MiB	32768	32232	98.36	16	0.05	504	1.54	16	0.05
13	64 MiB	65536	64480	98.39	32	0.05	1008	1.54	16	0.02
14	128 MiB	131072	128976	98.40	63	0.05	2016	1.54	17	0.01
15	256 MiB	262144	257968	98.41	126	0.05	4031	1.54	19	0.01
16	512 MiB	524288	515952	98.41	252	0.05	8062	1.54	22	0.00
17	1 GiB	1048576	1031928	98.41	504	0.05	16124	1.54	20	0.00
⋮
27	1 TiB	2³⁰
⋮
37	1 PiB	2⁴⁰
⋮

Smaller incremental regions may provide better real-time response, but limit the size of a macro region due to the number of access bits provided by the access restriction mechanism specified below. Larger incremental regions pause the application for longer and also require a larger summary popcount array, but allow for larger memory spaces. Each generations would likely choose different, appropriate incremental region sizes. Typically generation 0 (the most ephemeral) would use small incremental regions, while generation 3 (the most stable) would use incremental regions sized to fit the amount of data required.

A possible improvement to the algorithm is to have areas use slab allocation for a few common sizes. For example, there might be separate incremental regions for 1, 2, …, 8, and >8‑doubleword objects. This allows a simple freelist to be used for objects ≤8 doublewords so that compaction is not required on every GC. Incremental regions for ≤8 words might only be compacted when it would allow pages to be reclaimed or cache locality to be increased. Note that different tradeoffs may be appropriate for triggering compaction in ephemeral vs. long-lived generations. Also, bitmaps could be potentially use only one bit per object rather than one bit per word in 2‑word, 4‑word, and 8‑word regions, making these even more efficient. However, that requires a more complicated mark and translation code sequence.

ZGC

ZGC is one of the Garbage Collection algorithms for OpenJDK. The reference used here is Deep Dive into ZGC: A Modern Garbage Collector in OpenJDK by Yang and Wrigstad. It is fairly similar to MoonGC, except that it uses virtual memory to implement address restriction. The following exposition quotes Yang and Wrigstad interspersed with comments.

A major challenge concurrent collectors have to address is how to ensure that mutators and GC threads share a coherent view of the heap. Without special treatment, mutators may mutate the object graph in a way that invalidates a decision the GC just made. A typical solution to this problem is to have mutators run some extra code when objects are read (read barrier) or written (write barrier) so that the GC is informed and can act accordingly. The ZGC algorithm uses a special kind of read barrier, called a load barrier (explained in more detail in Section 3.1), to ensure mutators and GC share a coherent view. During marking, ZGC implements the incremental-update algorithm [Pirinen 1998; Wilson 1992], maintaining the strong tricolor invariant [Dijkstra et al. 1978], i.e., there are no pointers from black objects to white objects. After marking, sparsely populated regions are selected for evacuation. During the evacuation, load barriers are again used to maintain the invariant that mutators never see pointers into regions of memory marked for evacuation. Any such pointer will be updated in the load barrier to point to an object’s new location, possibly including performing the relocation, in competition with GC threads. This concept is very similar to Baker-style relocation [Baker 1978]. Because relocation information is stored outside the original regions, old regions can be reclaimed after evacuation, which concludes a ZGC cycle.

Unlike MoonGC, ZGC is not based on compaction of regions, but the evacuation of fragmented regions. Yang and Wrigstad mention in their section 3.8 that ZGC maintains off-heap forwarding tables (mapping pre-relocation (old) to post-relocation (new) addresses) for objects in regions being evacuated, but does not describe the data structures for this forwarding. These data structures are probably not as compact as the bitmaps used by MoonGC, but are only required on regions being evacuated. On the other hand, MoonGC occasionally needs to support inter-region compaction that ZGC does not.

To ensure that mutators see only valid pointers, even with GC running concurrently, ZGC utilizes colored pointers and load barriers. Pointers are always 64-bit structures, consisting of meta bits (color of the pointer) and address bits. The number of address bits determines the theoretical maximal heap size supported. Heap space is broken into memory regions of one of three size classes, small, medium, and large. These regions are called pages inside OpenJDK, and we adopt this terminology. To avoid confusion with OS pages, we will explicitly prefix page by OS when we mean an OS page from this point on. Depending on its size, an object is allocated on a page of one particular size class using bump-pointer allocation. Pages of small and medium size classes can accommodate multiple objects, but pages of large size classes hold only a single object. In other words, an object allocated on a large page gets its own page.

The three size classes represent a small refinement over the two classes in MoonGC. As in small object regions in MoonGC, the regions of the small and medium classes hold multiple objects, whereas large class regions are used only for single objects. The use of the pages terminology from OpenJDK is potentially confusing, and when not quoting, this overview of ZGC substitutes region for page.

The ZGC cycle consists of three STW pauses and four concurrent phases: marking/remapping (M/R), reference processing (RP), selection of evacuation candidates (EC), and relocation (RE).

In later versions of ZGC, the first and fourth phases were merged so that relocation performed marking for the next GC cycle. As in MoonGC, the application is only allowed to operate on relocated data. This is accomplished by using four address bits in pointers, which in ZGC terminology is termed the pointer’s color.

At any instant of time, there is a globally agreed-upon single good color, and its selection is decided twice during a ZGC cycle. … Once the good color is decided, all other colors are considered bad. Object creation always yields a pointer with the current good color. … The same physical heap memory is mapped into the virtual address space three times, resulting in three views of the same physical memory; each view corresponds to one good color … so that pointers with good color can be dereferenced directly, and the corresponding virtual-to-physical memory mapping will be used. Only the view corresponding to the current good color is active and accessed. … A read barrier is code executed when reading a pointer from the heap, e.g., in var x = obj.field, where x is a local variable living on the stack, and field is a pointer living on the heap. If the barrier is invoked on field, it is called a load barrier, ensuring only clean pointers live on the stack. In contrast, if the barrier is invoked on obj, it is called a use barrier, ensuring pointers are cleaned before using (dereferencing). ZGC uses load barriers to ensure the pointer in field has the good color.

The four address bits used for coloring in ZGC are basically used to detect when an address needs forwarding because it points to a region being evacuated. With processor support for address testing, this can be replaced by testing the address against CSRs that indicate whether forwarding is required or not. A given CSR matches part of the heap and provides a single bit of forwarding required or not for 128 subdivisions of that match. Once pointer coloring is no longer necessary, the distinction between load barriers and use barriers becomes less critical, but this proposal still suggests support for load barriers.

In the exposition below, the terminology introduced above for MoonGC applies to ZGC as follows. Macro regions are used for each generation and size class. Incremental regions correspond to what OpenJDK calls pages (not OS pages), and represent areas of the address space that may be marked for evacuation and thus be detected by a load barrier.

Access Restriction

Stop The World (STW) GC does not require access restriction (barriers), but the need to continue application work concurrently with GC introduces the need to detect situations where GC has started work, but that that some accesses require additional work before the application can continue. This proposal provides a general mechanism for fine-grain access control based only on addresses. This is a very general mechanism that may have many other uses than for GC. For MoonGC, it can be used to detect references to data that has yet to be compacted, and force that compaction, or for ZGC to detect references to regions that are being evacuated, so that the forwarding tables can be used to provide the new address. Both MoonGC and ZGC originally used multiple virtual to physical address mapping to effect access restriction, which requires significant overhead in the virtual memory subsystem and TLB misses during execution. Both would instead be modified to use the access restriction mechanism described below in this section. This also eliminates the dependency upon translation, which low-end processors may not have, and it is plausible GC might be useful on such processors. In particular, this proposal could be used in a mode without address translation (e.g. Machine Mode or when satp.Mode = Bare) or on a processor without translation at all, or it could be used on addresses prior to translation (virtual addresses). Because these addresses are for loads and stores, to simplify the exposition, this proposal uses the common terminology for a data address effective address for all of these cases.

Replacing page permission modification in MoonGC with address restriction is motivated by the need for GC threads to have access to pre-compaction incremental regions while the applications do not have access. Rather than dual mapping these incremental regions and providing different page permissions, this proposal employs a separate per-thread access mechanism.

This proposal currently only addresses RV64. The extension for RV32 and RV128 can be done if others are interested in pursuing this proposal as a starting point for RISC‑V GC.

This proposal shares a similar mechanism to the Embedded Sandbox proposal, and if pursued, should be unified with that. In particular, the TYP field in amatch registers allows alternate uses of this similar mechanism.

This extension would be enabled by a bit in the TBD Control Status Register (CSR).

This proposal starts with the assumption that software will designate one or more macro regions of the address space to be subject to additional access control. For example, when Garbage Collection is used for reclaiming allocated storage, only the heap might be subject to additional protection to implement read or write barriers. These macro regions of the address space are specified using a PMP-like NAPOT matching mechanism to provide flexible sizing. Implementations may support 1-16 macro regions, but matching for eight macro regions is suggested, which would support four generations of small object macro regions, and four generations of large object regions. Macro regions are matched against an effective address using a fully associative CAM. A match results in 128 access restriction bits, with one bit selected by the address bits below the match. In particular, for N macro regions, and NM1=N-1, there are Address Match CSRs amatch0 to amatchNM1. When a processor implements N amatchi registers, then it also implements exactly N corresponding adata0il/adata0ih pairs and adata1il/adata1ih pairs. The l registers hold bits 63:0 and the h registers hold bits 127:64. The adata registers have multiple uses depending on the value of the TYP field in amatch register. The format of the amatchi registers is as follows, using a NAPOT encoding of the bits to compare when testing for a match.

amatchi registers are WARL.

Address Match Registers
63			12	11	6	5	4	3	0
0	addr[VAW−1:13+S]	2^S		RSW		GEN		TYP
64−VAW	VAW−13−S	1+S		6		2		4

Fields of amatchi registers
Field	Width	Bits	Description
TYP	4	3:0	0 ⇒ Disabled 1 ⇒ Access restriction for GC 2 ⇒ Write restriction for GC 3 ⇒ Access and write restriction for GC 4 ⇒ Smallest GC generation contained (automatic update) 5 ⇒ Exception if smaller GC generation stored 6..15 ⇒ Reserved
GEN	2	5:4	For Address Restriction for GC (TYP in 1..5) this field specifies the GC Generation of this macro region.
RSW	6	11:6	Reserved for software
2^S	1+S	12+S:12	NAPOT encoding of address region to match
addr[VAW-1:13+S]	VAW−13−S	VAW−1:13+S	Address to match

When bits VAW−1:13+S of an effective address match the same bits of amatchi, and the TYP field is 1 or 3, then the corresponding adata0il/adata0ih pair specifies 128 additional access denial bits for the incremental regions thereof. Similarly the TYP field is 2 or 3, then the corresponding adata1il/adata1ih pair specifies 128 additional write denial bits for the incremental regions thereof. When the TYP field is 1, the adata1il/adata1ih pair is ignored (equivalent to being all zeros) and need not be context switched. When the TYP field is 2, the adata0il/adata0ih pair is ignored (equivalent to being all zeros) and need not be context switched. When the TYP field is 3, the combination of the corresponding bits in the adata0il/adata0ih and adata1il/adata1ih pairs both being set is Reserved for future use.

In particular, on a match to amatchi, the bits selected from the adata0il/adata0ih and adata1il/adata1ih pairs is given by bits 18+S:12+S of the effective address. For the pair of bits from adata1/adata0, 00 is no restriction, 01 is access (both loads and stores) restriction, 10 is store restriction only, and 11 is Reserved for future use. The value of S comes from the NAPOT encoding of amatchi registers, as the number zero bits starting from bit 12 (i.e., S=0 if bit 12 is 1, S=1 if bits 13:12 are 10, and so on). Setting bits 63:12 to 2⁵¹ causes it to match the entire address space. The lowest numbered amatchi match has priority. If no amatchi register matches then there is no additional access check.

Disabled amatchi registers (TYP = 0) may be used by the supervisor to reduce context switch overhead; disabled registers may be treated as having amatchi/adata0i/adata1i all zero.

When a GC thread finishes compaction of an incremental region, application access is not immediately enabled since that would require sending an interrupt to all the application threads telling them to update their adata0 registers. Instead the updated adata0 bits are stored in memory, and the next application exception will load the updated value before testing whether compaction is required, in progress, or still needs to be done.

adata1i registers are not required for MoonGC and ZGC, described above, and may be left set to zero for that algorithm. They could be omitted if another use is not found for them, but they may be useful for other GC algorithms that employ write barriers. They are proposed to hold a GC generation number below, which if accepted, would mean they could not be omitted.

ZGC employs load barriers rather than use barriers. This can be supported by adding a Load pointer for GC (LGC) instruction that applies the restrictions to the value being loaded rather than the address being loaded from. Below a Store pointer for GC (SGC) instruction is proposed for generational GC support.

The above says loads and stores are restricted by adata0/adata1, which could mean two things. First it could cause an exception to a handler at the current privilege level. This is the preferred mechanism as it provides the smallest code size and highest performance. RISC‑V does not currently have user-mode exception handlers and adding it would require uepc, utvec, ucause, utval registers and a URET instruction (a uscratch register is not required—tp could be used instead— but might be added as well). If it is undesirable to provide a user-mode fault mechanism, then the LTEST and STESTinstructions described below could be used and the sign bit of the result branched on.

LTEST and STEST instructions will be specified to indicate which amatchi register matches the address specified in rs1. In there is not a match, these instructions return 0. In addition to being useful in the trap handler, this may be used by the GC algorithm to determine which GC allocation area a pointer addresses. These instructions would return the matching incremental region address ea[62:12+S] in the same bits of rd. The low bits (11+S:0) are copied from the RSW and GEN fields of the matching amatch register. LTEST indicates load restriction and STEST store restriction. Both would indicate restriction by setting bit 63 of rd (this limits the address space to 2⁶³ bytes, which is not currently an issue for RISC‑V virtual address translation modes, but it would complicate use of this feature in supervisor mode). The case STEST also treats its rs2 operand, if not x0, as a pointer value being stored, to support Generational GC, as described below.

Generational GC

Generational GC works by collecting/compacting smaller short-lived areas frequently and larger long-lived areas less frequently, with the eventual migration of long-lived data in the frequent GC areas to less frequent ones. Call the most frequently collected area with the most ephemeral allocations generation 0, and the least frequently collected and most stable area generation 3, with generations 1 and 2 being in between. To mark lower numbered areas, pointers from higher numbered areas must serve as roots. A store of a pointer to a low numbered area to a location in a higher numbered area must be detected, and that location in the high number area marked as root for the low numbered area. The contents of these locations must serve to start the GC mark phase and updated during the relocation phase. Such stores are not frequent, so what is required is efficient detection of such a store, with slower handling in the infrequent case. The key observation is that both the data being stored and location being written need to have their generation numbers compared.

To implement Generational GC the amatch registers implement a 2‑bit GC generation number GEN and, in the case of user-mode exception support, a Store Pointer for GC (SGC) instruction is added. For SGC, the processor will already find a matching amatch for the effective address register, and so know the generation being stored to from bits 5:4 of the matching register. What is also required is performing the same lookup on value being stored, which is additional logic. If the generation being stored is less than generation of the location being stored to, then the store is restricted. If user-mode exceptions are not implemented, the STEST instruction would be used and the result tested before the store. In this case the rs1 operand would be x0 for a non-pointer store, and contain the pointer value to be stored otherwise, so that the above test can be included in the sign-bit result.

When garbage collecting generation 0, address restriction is only required for generation 0 incremental regions; the regions of generations 1..3 do not require restriction. When garbage collecting generation 1, address restriction is only required for generations 0 and 1 incremental regions; the regions of generations 2 and 3 do not require restriction, and so on. This allows the amatch to be used for another purpose based on a new value in the TYP field. In particular, the pair of bits of adata1i/adata0i selected by 18+S:12+S of the effective address are instead used as the smallest generation number to which that incremental region contains pointers. If the stored pointer has a generation smaller than this value, then the value is lowered to the pointer’s generation number for TYP = 4 and an exception occurs for TYP = 5.

Opcode Assignment

The preferred instantiation of this proposal calls for Load pointer for GC (LGC) and Store pointer for GC (SGC) instructions to be added. SGC could be encoded under the STORE major opcode with funct3=7. However, there is no encoding space for LGC under the LOAD major opcode, so this instruction would be located under major opcode MISC-MEM instead. A possible encoding is to use funct3=4 for LGC. Because these instructions are unlikely to require offsets with bits 1:0 nonzero, it is suggested that these bits be Reserved for future use, e.g. for other loads and stores.

Design Choices

Allowing address restriction to extend down to 64‑byte granularity is an arbitrary choice for GC. Larger granularity might reduce the cost by an almost meaningless amount. Smaller granularity is unlikely to be useful unless another use for address restriction is found that requires it and implemented with reserved values of the TYP field.
Providing 128 subregion permission bits allows for a graceful extension to RV128 and allows for larger data areas for the different regions. However, for RV32 it would require an expansion of pairs to 4 registers, or a reduction to just 64 subregions.

Future Work

Long-lived generations might be large, and 128 sub-regions thereof may result in a large number of data being considered roots for more ephemeral generations using TYP in 4..5. At some point a different mechanism might be required. For example, a macro region of 512 KiB would have generation dirty levels stored for 4 KiB (512 doublewords) incremental regions. Is 512 too many pointers to scan as roots? Beyond 512 KiB for macro regions, PTEs would have finer grain than TYP values of 4 or 5, but the cost of accessing PTE fields is high, given the cost of doing an ECALL. One possible finer grain solution would be a new TYP code that stores dirty levels at finer grain in a portion of the macro region (e.g. two bits for every 8 words or something similar). This however would require processor to implement a new cache of these values to make it efficient.
While the TYP values of 4 or 5 may be useful for storing generation numbers when collecting generation 0, there is a conflict when collecting more than just generation 0. Is the only answer to have yet more amatch registers? That is getting expensive.
Should decide whether amatchi etc. are regular CSRs or Indirect CSRs.
Obviously user-mode must not be able to restrict the addresses used by more privileged modes. There are multiple ways to address this, and one will need to be specified in the future.
Should it be required to support macro regions with more than 128 incremental regions, then it is possible to use the setting of adata0/adata1 bit pair where each is set to encode an escape to a memory table as follows:
00 ⇒ access allowed to all incremental regions in this intermediate region;
01 ⇒ write access denied to all incremental regions in this intermediate region;
10 ⇒ access denied to all incremental regions in this intermediate region;
11 ⇒ memory bitmap gives access for the incremental regions of this intermediate region.
A memory bitmap could be located in the upper 1/64th of the macro region virtual address space.

RISC-V Glossary

Some acronyms used herein may be unfamiliar as they come from the RISC‑V [wikilink] architecture.
The official RISC‑V ISA specifications may be downloaded from RISC‑V specifications while working versions may be found at the GitHub RISC‑V ISA Manual repository.
The two primary specifications are:

The RISC-V Instruction Set Manual Volume I: Unprivileged ISA
The RISC-V Instruction Set Manual Volume II: Privileged Architecture

CSR: An acronym for Control Status Register, a register used to control certain aspects of processor behavior.
Doubleword: For RISC‑V a word is defined as 32 bits, so a doubleword is 64 bits.
Indirect CSR: These are CSRs accessed by the RISC‑V Smcsrind/Sscsrind extensions.
NAPOT: An acronym for Naturally Aligned Power-of-2 (Chapter 5 “Svnapot” Standard Extension for NAPOT Translation Contiguity)
NAPOT refers to things of size 2^N bytes being aligned to that size (i.e. the address is a multiple of 2^N, viz. the least significant N bits of the address are zero). In RISC‑V NAPOT sizes are represented with repeated 1s (e.g. in PMP entries) or repeated 0s (e.g. in PTEs) in the lower address bits, then the opposite bit, and then from that point actual address bits, which allows the address and size to encoded in 1 bit more than the address alone.
PMP: An acronym for Physical Memory Protection (section 3.7.1 Physical Memory Protection CSRs), which is a mechanism for providing Read, Write, and Execute permission for ranges of physical memory addresses independent of the translation mechanism, and which is logically anded with those permissions.
WARL: An acronym for Write Any Values, Reads Legal Values, which is a CSR characterization that allows processors to implement a subset of the functionality described in a CSR (e.g. by hardwiring certain bits to fixed values).

		<webmaster at securerisc.org>
2024-07-22