profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/rrnewton/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.

eholk/harlan 1165

A language for GPU computing.

IUCompilerCourse/Essentials-of-Compilation 516

A book about compiling Racket and Python to x86-64 assembly

Co-dfns/Co-dfns 465

High-performance, Reliable, and Parallel APL

haskell/criterion 460

A powerful but simple library for measuring the performance of Haskell code.

pycket/pycket 219

A rudimentary Racket implementation using RPython

MPLLang/mpl 81

The MaPLe compiler for Parallel ML

lkuper/lvar-examples 34

Programming with LVars, by example

dettrace/dettrace 22

A determinizing tracer using Ptrace

lkuper/dissertation 21

Exactly what it says on the tin.

eamsden/pushbasedFRP 18

TimeFlies: Push-Pull Signal-Function Functional Reactive Programming (Master's Thesis)

startedfacebookincubator/Glean

started time in 18 days

startedfepitre/debrebuild

started time in 18 days

startedslsa-framework/slsa

started time in a month

startedHelium4Haskell/helium

started time in a month

startedwz1000/packed-traverse

started time in a month

issue commentiu-parfunc/gibbon

A GC algorithm for semi-packed data

@rrnewton layout computation starting from roots sounds like tag-free garbage collection. Intensional polymorpism and "type-passing polymorphism" could be related as well.

Thanks! I hadn't read this. IMO, some of these older papers are so refreshing to read because they assume less context and explain the problems clearly from scratch, and their citations in turn are fewer degrees removed from bedrock foundations ;-).

We do indeed like tag-free GC for Gibbon. We also get to solve a much easier problem than Tolmach'94 as long as we lean on our closed-world assumption (whole-program compilation) and compile only monomorphic, first-order programs in the backend, and we also avoid mutable data on the heap. (Except for arrays, atm, which are managed with linear types.)

The cool thing about tag-freedom in the above generational proposal is that we really only need type info as deep as the nursery. Old generation data can be reclaimed without knowing what type it is (reference count hits zero, drop the entire region).

rrnewton

comment created time in a month

issue commentiu-parfunc/gibbon

A GC algorithm for semi-packed data

@ckoparkar - it was a nice result that you got on the overhead of repeated consing (as in reverse, #126) being lower than expected for Gibbon vs a malloc-list.

Moving forward, it seems less exciting to do any incremental-improvements on small regions using Gibbon's current strategy (#100). It's always going to be making a suboptimal solution less bad. What seems more exciting is spending energy on one of the following two possibilities:

(1) dupable linear lists (internally mutable list-of-vectors), (2) a generational collector that is copying/bump-alloc in the nursery, and the current region-refcounting strategy in the old generation.

Re (1), this gets more at what we think is the optimal runtime representation for enviroments in compiler passes.

Re (2): The beginnings of a GC proposal in this old issue here didn't get to the point of describing generational collection. But generational collection, with heterogeneous representations, could be the thing that manages impedance mismatching between the types of programs that dump huge trees into regions, and the programs that allocate many small ("Bounded", #79) regions.

Generational Gibbon GC

Just to spell out the proposal a bit, the idea is that we simply use a traditional nursery, and every let-region turns into a fixed size bump-alloc in the nursery. If we can statically bound a region's size --- especially to a single heap object like a Cons cell --- then we alloc that size. If not, we alloc whatever the initial chunk size is, K, a relatively small value. (Ideally, we base these tunings on profiling data and I'm hoping that the compiler can recognize O(1) constant sized thing, and that the other regions are tuned to something reasonable for O(log(N)) regions that house a small, but dynamic, number of heap objects over their lifetime.)

We enumerate roots using an existing strategy (like @vollmerm mentioned). When the nursery is full, we evacuate it to Gibbon's regular, reference-counted heap of regions (malloc-managed). Thus if you allocate, say, a cons list out of order, you initially have heavy pointer chasing when you read the list. But once those elements graduate to the old generation, then you get the normal copying GC locality benefit on steriods, because the data becomes more densely packed.

One interesting wrinkle is what to do when growing a region in the nursery. There are several choices. You could keep the new (larger) chunks of the region in the nursery. But if we look carefully at data on region size distribution, I bet the bayesian answer will be that once a region starts growing it is likely to go large. It probably makes sense to allocate everything but the first chunk of a region directly into a reference-counted generation. But this would mean that, unlike the normal region growth process (where region identity, outset, and reference count is shared by all chunks), the nursery chunk 0 and the refcounted chunk 1 would effectively be separate regions. I.e. the redirection pointer at the end of chunk 0 would be no different than pointers from any nursery region into any older region.

A potential downside of this eager promotion would be that as soon as we grow a nursery region we start incurring the overhead of tracking outsets and reference counts when writing indirections. Whereas nursery=>old pointers are very cheap as usual; just like in deferred reference counting, they don't count against any reference counts tracked in the heap.

Hopefully, the end result would be that our speed of allocating (out of order) and traversing cons-lists would be no better and not-much-worse than existing typed functional language implementations --- which is quite good! Furthermore, in-order list operations like map should be much faster out of the box! (They would stream directly to a packed representation on the output, irrespective of how fragmented their inputs are.) From the perspective of programs like map, or the tree-traversals we've focused on in our previous papers (ECOOP'17, PLDI'19, ICFP'21), there should be minimal overhead for a brief pitstop in the nursery. Just one chunk gets copied!

Complications

The nasty bit is dealing with the heterogeneity of indirection pointers. If they point into the nursery, they point directly at a (packed) heap object with no region-footer objects for metadata. Most of these objects will not get any benefit from packing, and in fact will have a little bit of inefficiency due to extra I tags compared to a traditional ML/Haskell implementation.

Pointers within the old generation will work as they do now, and pointers from the nursery to the old generation will need to make it possible to reach the footer. Right now, Cursorize manages to have available end-pointers for the from/to regions whenever an indirection is written. This makes reaching the footer objects trivial, because we use the footer pointer as the end pointer. The GC, when it is copying a pointer-to-old from the nursery to the old generation, will effectively be executing this indirection-creation code without the benefit of that compile-time information.

The GC will need a runtime way to find footers that correspond to an arbitrary address in the old generation. Of course, distinguishing nursery and old generation pointers based on addresses will be trivial, as these can be non-overlapping address ranges. We will only be allocating nice, clean, power-of-2 chunk sizes, and we can further expect that our memory management system should be able to tell us the size of the allocated chunk containing an arbitrary pointer. If we always put the footer at the end of that malloc'd block, then we can

There's already the not-really-portable malloc_usable_size, but I don't actually know if glibc ptmalloc implementation of it let's you give it arbitrary pointers, in the middle of allocations, or just the pointers that were returned from malloc(). (Easy to test.) There also may be some extra motivation to customize the allocator to guarantee that when we malloc(2^n) that we always get back a precise allocation, with no extra overhead at the end. Otherwise we'd need to call malloc_usable_size every time on the pointers we get back from malloc, just to make sure we know the true end of the block.

Footnotes

Using tracying/copying in the nursery and reference counting after that makes this similar to Ulterior Reference Counting (2003). However, with the complication of packing and region-level reference counting rather than object-level reference counting, of course.

rrnewton

comment created time in a month