profile
viewpoint

jasone/Hemlock 0

Programming language

jasone/jemalloc-ci 0

Continuous integration for jemalloc

jasone/skip 0

A programming language to skip the things you have already computed

PR opened jemalloc/jemalloc

HPA: More directly assign hugepage ownership to arenas

Each of these commits is individually small, but put together they are a fairly big design shift in the HPA. We now explicitly track a whole hugepage's state in an hpa_shard-local context. This lets us make smarter decisions about when to hugeify or not, and also allows purging hugepages in a reasonable way.

For now, we only purge whole hugepages, and only when they become completely empty, but this structure will allow finer-grained / smarter strategies as we continue to evolve.

+1188 -1046

0 comment

25 changed files

pr created time in 2 days

PR opened jemalloc/jemalloc

Allow fractional settings for narenas_ratio.

There's no fundamental reason this ratio ought to be an integer. With the HPA allowing (and benefitting from) much lower arena counts, this has become a day-to-day pain. Setting the number of arenas to (say) half the number of CPUs is hard to do without knowing the environment you'll run on.

+595 -6

0 comment

9 changed files

pr created time in 2 days

issue openedjemalloc/jemalloc-experiments

CMake Issue

CMake Error at CMakeLists.txt:12 (target_link_libraries): Cannot specify link libraries for target "jemalloc" which is not built by this project.

target_link_libraries(jemalloc INTERFACE dl)

created time in 3 days

pull request commentjemalloc/jemalloc

utrace support with label based signature.

Thanks!

devnexen

comment created time in 3 days

push eventjemalloc/jemalloc

David Carlier

commit sha 520b75fa2daf3313d87780f40ca0101c83c10398

utrace support with label based signature.

view details

push time in 3 days

pull request commentjemalloc/jemalloc

utrace support with label based signature.

Yes they pass for both.

devnexen

comment created time in 3 days

pull request commentjemalloc/jemalloc

utrace support with label based signature.

Is this for OpenBSD (just from Googling, that's the only OS I can find that takes a utrace label). Could you confirm that running tests when configure with both:

./configure --enable-debug --enable-utrace

and

./configure --enable-utrace

Pass? We don't have CI set up for those cases, and I don't have an OpenBSD system readily available to test on.

devnexen

comment created time in 3 days

PR closed jemalloc/jemalloc

Fix gcc mem barriers for arm.

membar is a sparc instruction, using DMB here.

+3 -1

2 comments

1 changed file

devnexen

pr closed time in 3 days

pull request commentjemalloc/jemalloc

Fix gcc mem barriers for arm.

yes you re right.

devnexen

comment created time in 3 days

pull request commentjemalloc/jemalloc

Fix gcc mem barriers for arm.

I think this code is right as is -- note that we're checking __arch64__ (single a) rather than __aarch64__ (two a's); i.e. 64-bit SPARC (I don't totally remember why this was the decision; perhaps that's the only static clue we get that the barrier is definitely present?).

The ARM case doesn't match any of specific preprocessor defines, so it'll go down the default path and call __sync_synchronize(), which should be right.

devnexen

comment created time in 3 days

issue commentjemalloc/jemalloc

Question about decreasing memory consumed obviously

Why can using jemalloc reduce so much memory consumption, is the reduced part memory fragments?

I think that this may one source, but my recollection is that the glibc malloc's heuristics about when to return memory to the OS are less developed as well.

Is there any tool or method for verification?

I suppose the gold-standard for memory consumption is OS-level RSS counters (in, say, top).

firejq

comment created time in 3 days

issue openedjemalloc/jemalloc

Question about decreasing memory consumed obviously

I have used jemalloc to replace ptmalloc of glibc in a machine learning multi-threads training scene, of which the result produced a significant effect, the memory consumed by the program was reduced by 50%. But while surprised, it made a little confused for me. I read some related articles, such as the following links. They also produced obvious memory reduction, but the specific reason is not clear in the article.

Therefore, I raised this issue.

  1. Why can using jemalloc reduce so much memory consumption, is the reduced part memory fragments?
  2. Is there any tool or method for verification?

I am looking forward to any suggestions. Thanks!

Ref:

created time in 4 days

PR opened jemalloc/jemalloc

Fix gcc mem barriers for arm.

membar is a sparc instruction, using DMB here.

+3 -1

0 comment

1 changed file

pr created time in 4 days

PR opened jemalloc/jemalloc

Rename cumbits to cumulative_bits

Hi, I'm deeply sorry for this PR, but the internet is full of people which gets offended easily with jokes, etc..

So, I would like to suggest to rename cumbits to cumulative_bits to prevent having people getting offended by variable name that might be seen as sexual related jokes.

i will understand if you will mark invalid or close this PR, without merging.

+2 -2

0 comment

1 changed file

pr created time in 6 days

PR opened jemalloc/jemalloc

utrace support with label based signature.
+30 -6

0 comment

4 changed files

pr created time in 10 days

issue commentjemalloc/jemalloc

jemalloc crashes on random places

Problem solved, sorry for falsa alarm. It was mistake in puppet, so application was launched with LD_PRELOAD and newer malloc version compiled in :(

xphoenix

comment created time in 11 days

issue closedjemalloc/jemalloc

jemalloc crashes on random places

Version: 5.2.1 Behaviour: crashes with random application stacks Usage: statically link, ensure that jemalloc is the first in library list Compiler: gcc-7.3.1, c++17 What have I cheked:

  • check ld ---verbose output and yes, jemalloc is the first in the list
  • build application without jemalloc but with address and memory sanitizer and with -fsized-deallocation. Application works fine

When built with jemalloc it randomly crashes causing deadlock in application due to signal handler:

#0  atomic_load_p (mo=atomic_memory_order_relaxed, a=<optimized out>) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62
#1  rtree_leaf_elm_bits_read (dependent=true, elm=<optimized out>, rtree=<optimized out>, tsdn=<optimized out>) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:162
#2  rtree_szind_slab_read_fast (r_slab=<synthetic pointer>, r_szind=<synthetic pointer>, key=0, rtree_ctx=<optimized out>, rtree=<optimized out>, tsdn=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:474
#3  emap_alloc_ctx_try_lookup_fast (alloc_ctx=<synthetic pointer>, ptr=0x0, emap=<optimized out>, tsd=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/emap.h:194
#4  free_fastpath (size_hint=false, size=0, ptr=ptr@entry=0x0) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2703
#5  free (ptr=ptr@entry=0x0) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2761
#6  0x00007f342d4ba733 in _IO_vfprintf_internal (s=<optimized out>, format=<optimized out>, ap=<optimized out>) at vfprintf.c:2058
#7  0x00007f342d4bfe5b in buffered_vfprintf (s=s@entry=0x7f342d83a1c0 <_IO_2_1_stderr_>, format=format@entry=0x1024a8c " in thread %u:\n", args=args@entry=0x7f3339bf1328) at vfprintf.c:2319
#8  0x00007f342d4ba81e in _IO_vfprintf_internal (s=0x7f342d83a1c0 <_IO_2_1_stderr_>, format=0x1024a8c " in thread %u:\n", ap=ap@entry=0x7f3339bf1328) at vfprintf.c:1289
#9  0x00007f342d4c5407 in __fprintf (stream=stream@entry=0x7f342d83a1c0 <_IO_2_1_stderr_>, format=format@entry=0x1024a8c " in thread %u:\n") at fprintf.c:32
#10 0x00000000007a0aab in print_header (this=0x7f3339bf15e0, thread_id=2271, os=0x7f342d83a1c0 <_IO_2_1_stderr_>) at /usr/src/debug/crow-cluster/third-party/backward-cpp/backward.hpp:1777
#11 backward::Printer::print<backward::StackTrace> (this=this@entry=0x7f3339bf15e0, st=..., os=0x7f342d83a1c0 <_IO_2_1_stderr_>) at /usr/src/debug/crow-cluster/third-party/backward-cpp/backward.hpp:1750
#12 0x00000000007a0ec4 in backward::SignalHandling::sig_handler (info=0x7f3339bf17b0, _ctx=<optimized out>) at /usr/src/debug/crow-cluster/third-party/backward-cpp/backward.hpp:1967
#13 <signal handler called>
#14 atomic_store_p (mo=atomic_memory_order_release, val=0xe8000000000000, a=0x5600) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62
#15 rtree_leaf_elm_write (slab=false, szind=232, edata=0x0, elm=0x5600, rtree=<optimized out>, tsdn=0x7f3339bf4390) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:288
#16 emap_rtree_write_acquired (slab=false, szind=232, edata=0x0, elm_b=0x7f2cf8d54140, elm_a=0x5600, emap=<optimized out>, tsdn=0x7f3339bf4390)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/emap.c:142
#17 je_emap_deregister_boundary (tsdn=tsdn@entry=0x7f3339bf4390, emap=<optimized out>, edata=edata@entry=0x7f33c8dcf970) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/emap.c:186
#18 0x00000000009dcdea in extent_deregister_impl (tsdn=tsdn@entry=0x7f3339bf4390, edata=0x7f33c8dcf970, gdump=gdump@entry=true) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:357
#19 0x00000000009df72f in extent_deregister (edata=<optimized out>, tsdn=0x7f3339bf4390) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:371
#20 extent_recycle_split (growing_retained=<optimized out>, edata=<optimized out>, szind=40, slab=false, alignment=64, pad=4096, size=32768, new_addr=0x0, ecache=0x7f33388031a8, ehooks=0x7f33388000c0, 
    arena=0x7f33388008c0, tsdn=0x7f3339bf4390) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:570
#21 extent_recycle (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, ehooks=ehooks@entry=0x7f33388000c0, ecache=ecache@entry=0x7f33388031a8, new_addr=new_addr@entry=0x0, size=size@entry=32768, 
    pad=<optimized out>, alignment=<optimized out>, slab=<optimized out>, szind=<optimized out>, zero=<optimized out>, commit=<optimized out>, growing_retained=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:605
#22 0x00000000009df861 in je_ecache_alloc (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, ehooks=ehooks@entry=0x7f33388000c0, ecache=ecache@entry=0x7f33388031a8, new_addr=new_addr@entry=0x0, 
    size=size@entry=32768, pad=4096, alignment=64, slab=false, szind=40, zero=0x7f3339bf1f8f) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:108
#23 0x00000000009a53b5 in je_arena_extent_alloc_large (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, usize=usize@entry=32768, alignment=alignment@entry=64, zero=zero@entry=0x7f3339bf1f8f)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/arena.c:434
#24 0x00000000009e1be2 in je_large_palloc (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, usize=32768, alignment=alignment@entry=64, zero=zero@entry=false)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/large.c:48
#25 0x00000000009e207e in je_large_malloc (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, usize=<optimized out>, zero=zero@entry=false)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/large.c:18
#26 0x000000000099436c in tcache_alloc_large (slow_path=false, zero=false, binind=<optimized out>, size=32768, tcache=0x7f3339bf45d8, arena=<optimized out>, tsd=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/tcache_inlines.h:108
#27 arena_malloc (slow_path=false, tcache=0x7f3339bf45d8, zero=false, ind=<optimized out>, size=32768, arena=0x0, tsdn=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:173
#28 iallocztm (slow_path=false, arena=0x0, is_internal=false, tcache=0x7f3339bf45d8, zero=false, ind=<optimized out>, size=32768, tsdn=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:53
#29 imalloc_no_sample (ind=<optimized out>, usize=32768, size=32768, tsd=0x7f3339bf4390, dopts=<synthetic pointer>, sopts=<synthetic pointer>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2005
#30 imalloc_body (tsd=0x7f3339bf4390, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2196
#31 imalloc (dopts=<synthetic pointer>, sopts=<synthetic pointer>) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2303
#32 je_malloc_default (size=32768) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2334
#33 0x00000000009953a4 in malloc (size=size@entry=32768) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2442
#34 0x0000000000a1b859 in newImpl<false> (size=32768) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc_cpp.cpp:92
#35 operator new (size=32768) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc_cpp.cpp:139
#36 0x00000000008e0f24 in allocate (this=<optimized out>, __n=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/ext/new_allocator.h:111
#37 allocate (__a=..., __n=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/alloc_traits.h:436
---Type <return> to continue, or q <return> to quit---
#38 _M_allocate (this=<optimized out>, __n=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_vector.h:172

Looks like heap corruption, for me at least, but sanitizers are unnable to find anyting. Server built without jemalloc works weeks under load without leaking or any noticable problem with memory :(

closed time in 11 days

xphoenix

issue openedjemalloc/jemalloc

jemalloc crashes on random places

Version: 5.2.1 Behaviour: crashes with random application stacks Usage: statically link, ensure that jemalloc is the first in library list What have I cheked:

  • check ld ---verbose output and yes, jemalloc is the first in the list
  • build application without jemalloc but with address and memory sanitizer and with -fsized-deallocation. Application works fine

When built with jemalloc it randomly crashes causing deadlock in application due to signal handler:

#0  atomic_load_p (mo=atomic_memory_order_relaxed, a=<optimized out>) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62
#1  rtree_leaf_elm_bits_read (dependent=true, elm=<optimized out>, rtree=<optimized out>, tsdn=<optimized out>) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:162
#2  rtree_szind_slab_read_fast (r_slab=<synthetic pointer>, r_szind=<synthetic pointer>, key=0, rtree_ctx=<optimized out>, rtree=<optimized out>, tsdn=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:474
#3  emap_alloc_ctx_try_lookup_fast (alloc_ctx=<synthetic pointer>, ptr=0x0, emap=<optimized out>, tsd=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/emap.h:194
#4  free_fastpath (size_hint=false, size=0, ptr=ptr@entry=0x0) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2703
#5  free (ptr=ptr@entry=0x0) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2761
#6  0x00007f342d4ba733 in _IO_vfprintf_internal (s=<optimized out>, format=<optimized out>, ap=<optimized out>) at vfprintf.c:2058
#7  0x00007f342d4bfe5b in buffered_vfprintf (s=s@entry=0x7f342d83a1c0 <_IO_2_1_stderr_>, format=format@entry=0x1024a8c " in thread %u:\n", args=args@entry=0x7f3339bf1328) at vfprintf.c:2319
#8  0x00007f342d4ba81e in _IO_vfprintf_internal (s=0x7f342d83a1c0 <_IO_2_1_stderr_>, format=0x1024a8c " in thread %u:\n", ap=ap@entry=0x7f3339bf1328) at vfprintf.c:1289
#9  0x00007f342d4c5407 in __fprintf (stream=stream@entry=0x7f342d83a1c0 <_IO_2_1_stderr_>, format=format@entry=0x1024a8c " in thread %u:\n") at fprintf.c:32
#10 0x00000000007a0aab in print_header (this=0x7f3339bf15e0, thread_id=2271, os=0x7f342d83a1c0 <_IO_2_1_stderr_>) at /usr/src/debug/crow-cluster/third-party/backward-cpp/backward.hpp:1777
#11 backward::Printer::print<backward::StackTrace> (this=this@entry=0x7f3339bf15e0, st=..., os=0x7f342d83a1c0 <_IO_2_1_stderr_>) at /usr/src/debug/crow-cluster/third-party/backward-cpp/backward.hpp:1750
#12 0x00000000007a0ec4 in backward::SignalHandling::sig_handler (info=0x7f3339bf17b0, _ctx=<optimized out>) at /usr/src/debug/crow-cluster/third-party/backward-cpp/backward.hpp:1967
#13 <signal handler called>
#14 atomic_store_p (mo=atomic_memory_order_release, val=0xe8000000000000, a=0x5600) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62
#15 rtree_leaf_elm_write (slab=false, szind=232, edata=0x0, elm=0x5600, rtree=<optimized out>, tsdn=0x7f3339bf4390) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:288
#16 emap_rtree_write_acquired (slab=false, szind=232, edata=0x0, elm_b=0x7f2cf8d54140, elm_a=0x5600, emap=<optimized out>, tsdn=0x7f3339bf4390)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/emap.c:142
#17 je_emap_deregister_boundary (tsdn=tsdn@entry=0x7f3339bf4390, emap=<optimized out>, edata=edata@entry=0x7f33c8dcf970) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/emap.c:186
#18 0x00000000009dcdea in extent_deregister_impl (tsdn=tsdn@entry=0x7f3339bf4390, edata=0x7f33c8dcf970, gdump=gdump@entry=true) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:357
#19 0x00000000009df72f in extent_deregister (edata=<optimized out>, tsdn=0x7f3339bf4390) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:371
#20 extent_recycle_split (growing_retained=<optimized out>, edata=<optimized out>, szind=40, slab=false, alignment=64, pad=4096, size=32768, new_addr=0x0, ecache=0x7f33388031a8, ehooks=0x7f33388000c0, 
    arena=0x7f33388008c0, tsdn=0x7f3339bf4390) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:570
#21 extent_recycle (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, ehooks=ehooks@entry=0x7f33388000c0, ecache=ecache@entry=0x7f33388031a8, new_addr=new_addr@entry=0x0, size=size@entry=32768, 
    pad=<optimized out>, alignment=<optimized out>, slab=<optimized out>, szind=<optimized out>, zero=<optimized out>, commit=<optimized out>, growing_retained=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:605
#22 0x00000000009df861 in je_ecache_alloc (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, ehooks=ehooks@entry=0x7f33388000c0, ecache=ecache@entry=0x7f33388031a8, new_addr=new_addr@entry=0x0, 
    size=size@entry=32768, pad=4096, alignment=64, slab=false, szind=40, zero=0x7f3339bf1f8f) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/extent.c:108
#23 0x00000000009a53b5 in je_arena_extent_alloc_large (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, usize=usize@entry=32768, alignment=alignment@entry=64, zero=zero@entry=0x7f3339bf1f8f)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/arena.c:434
#24 0x00000000009e1be2 in je_large_palloc (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, usize=32768, alignment=alignment@entry=64, zero=zero@entry=false)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/large.c:48
#25 0x00000000009e207e in je_large_malloc (tsdn=tsdn@entry=0x7f3339bf4390, arena=arena@entry=0x7f33388008c0, usize=<optimized out>, zero=zero@entry=false)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/large.c:18
#26 0x000000000099436c in tcache_alloc_large (slow_path=false, zero=false, binind=<optimized out>, size=32768, tcache=0x7f3339bf45d8, arena=<optimized out>, tsd=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/tcache_inlines.h:108
#27 arena_malloc (slow_path=false, tcache=0x7f3339bf45d8, zero=false, ind=<optimized out>, size=32768, arena=0x0, tsdn=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:173
#28 iallocztm (slow_path=false, arena=0x0, is_internal=false, tcache=0x7f3339bf45d8, zero=false, ind=<optimized out>, size=32768, tsdn=<optimized out>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:53
#29 imalloc_no_sample (ind=<optimized out>, usize=32768, size=32768, tsd=0x7f3339bf4390, dopts=<synthetic pointer>, sopts=<synthetic pointer>)
    at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2005
#30 imalloc_body (tsd=0x7f3339bf4390, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2196
#31 imalloc (dopts=<synthetic pointer>, sopts=<synthetic pointer>) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2303
#32 je_malloc_default (size=32768) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2334
#33 0x00000000009953a4 in malloc (size=size@entry=32768) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc.c:2442
#34 0x0000000000a1b859 in newImpl<false> (size=32768) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc_cpp.cpp:92
#35 operator new (size=32768) at /usr/src/debug/crow-cluster/third-party/jemalloc-5.2.1/src/jemalloc_cpp.cpp:139
#36 0x00000000008e0f24 in allocate (this=<optimized out>, __n=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/ext/new_allocator.h:111
#37 allocate (__a=..., __n=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/alloc_traits.h:436
---Type <return> to continue, or q <return> to quit---
#38 _M_allocate (this=<optimized out>, __n=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/stl_vector.h:172

Looks like heap corruption, for me at least, but sanitizers is unnable to find anyting. Server built without jemalloc works weeks under load without leaking or any noticable problem with memory :(

created time in 12 days

issue commentjemalloc/jemalloc

jemalloc 5.1.0 fails in unit tests with gcc8 and profiling enabled on armv7

It now succeeds with gcc-10 branch head (a78cd759754c92cecbf235ac9b) and glibc 2.32. No idea what fixed it, though.

ggardet

comment created time in 14 days

PR opened jemalloc/jemalloc

Backport jeprof --collapse for flamegraph generation

This patch backports the --collapse flag that was added to pprof (source) for generating flamegraphs.

See https://github.com/gperftools/gperftools/commit/6ce10a2a05f13803061538d5c77e89695de59be4 for the original commit by @matt-cross.

+52 -0

0 comment

1 changed file

pr created time in 14 days

push eventjemalloc/jemalloc

Yinan Zhang

commit sha 566c4a8594d433ac40ebfd5a4736a53c431f81dd

Slight changes to cache bin internal functions

view details

Yinan Zhang

commit sha 4a65f34930fb5e72b2d6ab55d23b5971a5efefbd

Fix a cache bin test

view details

Yinan Zhang

commit sha be5e49f4fa09247a91557690cdaef42a82a83d6a

Add a batch mode for cache_bin_alloc()

view details

Yinan Zhang

commit sha ac480136d76010243f50997a1c1231a5572548aa

Split out locality checking in batch allocation tests

view details

Yinan Zhang

commit sha d96e4525adaefbde79f349d024eb5f94e72faf50

Route batch allocation of small batch size to tcache

view details

Yinan Zhang

commit sha 92e189be8b725be1f4de5f476f410173db29bc7d

Add some comments to the batch allocation logic flow

view details

push time in 17 days

PR merged jemalloc/jemalloc

Route batch allocation of small batch size to tcache

For batch allocation requests of small sizes, instead of going to the page allocator for a fresh new slab, follow the ordinary hierarchy i.e. tcache -> arena -> page allocator, to improve performance. The downside is of course memory locality; however, locality was originally a best effort anyway.

A few remarks on the current version:

  • I don't yet have a great way to determine the size threshold for deciding which route ("tcache route" / "page allocator route") the request should go through. Right now I just put an arbitrary number as a placeholder. I think there are a few aspects we need to consider here:
    • Improvement on performance. If we plot the difference in performance between the two routes against the batch size, then I imagine there are two big jumps: one is around the point at which the request cannot be fulfilled by the tcache without going to the arena, the other is around the point at which the request cannot be fulfilled by existing slabs in the arena without going to the page allocator. Both these two points can be a wider range instead of a single point (so the jump may be a steady one rather than a sudden one). Meanwhile, not sure about the magnitude of the two jumps: I suspect that the first jump should be quite significant, but how about the second one?
    • Cost on locality: locality is correlated with performance of later memory access by the caller, in particular sequential memory access, so locality also comes with a performance price tag, though a harder-to-measure one. Therefore, in theory, we can also plot the difference in memory access performance between the two routes against the batch size, and try to find a good spot to set the threshold. The order of memory locality, from worst to best, is tcache -> existing slabs in arena -> fresh new slabs; not sure how that'd translate to how the plot would look like.
    • It seems to me that many things I touched above have some dependencies on the caller. So, if we want to be lazy / flexible, we also have the option of asking the caller to choose, and to of course also bear the tuning effort. In such a case, we may have a default (of e.g. always going through one route). For the caller flag, we can use MALLOCX_TCACHE_AUTO / MALLOCX_TCACHE_NONE for the distinction.
  • If we decide to have a size threshold ourselves rather than letting the caller specify, then we may want to allow the caller to specify the mallocx tcache flag. If the batch size is large, we can ignore the specified flag. However, if the batch size is small, then we need to decide what to do when the caller specifies MALLOCX_TCACHE_NONE. Obviously, we can ignore it. Meanwhile, it might also make some sense to treat that as "I don't care about performance", so we can go for the "page allocator route", to optimize for locality. If so, then the caller essentially has a "one way" overwrite privilege to our "route decision".
  • Previously, I applied special care to the batch allocation API so that, whenever a manual arena is specified, we only serve memory strictly from that arena. Now, if tcache is allowed, then we can no longer guarantee that in the batch allocation API, just as the case for all other APIs we have. Therefore, I decide to simply save the effort and not try even when I can i.e. when the tcache is bypassed.
  • When going through the tcache route, I make the tcache bin act as a middleman, and I never explicitly touch the arena level. This approach may not be optimal if the batch size is large enough to require more than one refill from the arena, when compared to the alternative approach where, whenever the tcache is depleted, I directly go to the arena and collect the remaining batch. My current approach is just easier to implement, since it roughly follows the original logic structure I had. Meanwhile, it might not be too bad, because we only decide to route through the tcache when the batch size is small, and it has the advantage of not leaving the tcache empty for the next allocation request. However, I'm also open to the alternative approach, and my guess is that I'd need to do some refactorings first for it.
  • For unit tests, I currently short-circuit the locality and stats tests whenever the tcache route is chosen: the locality tests are irrelevant; there might be some opportunity for some stats tests, but it might be complicated to write. Please let me know your thoughts.
  • I previously did not optimize for large sizes, but now it's sometimes possible, when (a) the size is not too large to be cached in tcache and (b) the batch size is small. Will do that after the core logic settles.
+227 -111

6 comments

5 changed files

yinan1048576

pr closed time in 17 days

pull request commentjemalloc/jemalloc

Add support for QNX.

Sure, no problem at all.

Hi David, a new pull request was uploaded here under my company email account. https://github.com/jemalloc/jemalloc/pull/1972

It received the same failure though. Not sure if there's anything I can do. Please let me know. I'm happy to move it forward. "User is too new to use Community clusters! Please check with support!"

Thanks! Jin

jqian82

comment created time in 17 days

Pull request review commentjemalloc/jemalloc

Route batch allocation of small batch size to tcache

 batch_alloc(void **ptrs, size_t num, size_t size, int flags) { 					bin = &tcache->bins[ind]; 				} 			}+			/*+			 * If we don't have a tcache bin, we don't want to+			 * immediately give up, because there's the possibility

As far as I understand, that should be the main one. Some other possibilities include:

  • tcache has not been initialized, which should be rare.
  • tcache is disabled by the caller explicitly.
yinan1048576

comment created time in 17 days

Pull request review commentjemalloc/jemalloc

Route batch allocation of small batch size to tcache

 batch_alloc(void **ptrs, size_t num, size_t size, int flags) { 					bin = &tcache->bins[ind]; 				} 			}+			/*+			 * If we don't have a tcache bin, we don't want to+			 * immediately give up, because there's the possibility+			 * that the user explicitly requested to bypass the+			 * tcache, and in such cases, we go through the slow+			 * path, i.e. the mallocx() call at the end of the+			 * while loop.  A slight overhead here would be that+			 * the "if (bin == NULL)" branch above will be entered+			 * in every iteration of the while loop, but it+			 * shouldn't be a huge issue, because (a) the+			 * computations there are cheap and (b) users+			 * requesting to bypass the tcache care less about+			 * performance.+			 */ 			if (bin != NULL) { 				size_t bin_batch = batch - progress;+				/*+				 * n can be less than bin_batch, meaning that+				 * the cache bin does not have enough memory.+				 * In such cases, we rely on the slow path,+				 * i.e. the mallocx() call at the end of the+				 * while loop, to fill in the cache, and in the+				 * next iteration of the while loop, the tcache+				 * will contain a lot of memory, and we can+				 * harvest them here.  Compared to the+				 * alternative approach where we directly go to+				 * the arena bins here, the overhead of our+				 * current approach should usually be minimal,+				 * since the tcache is usually larger than a

You're right.

yinan1048576

comment created time in 17 days

Pull request review commentjemalloc/jemalloc

Route batch allocation of small batch size to tcache

 batch_alloc(void **ptrs, size_t num, size_t size, int flags) { 					bin = &tcache->bins[ind]; 				} 			}+			/*+			 * If we don't have a tcache bin, we don't want to+			 * immediately give up, because there's the possibility+			 * that the user explicitly requested to bypass the+			 * tcache, and in such cases, we go through the slow+			 * path, i.e. the mallocx() call at the end of the+			 * while loop.  A slight overhead here would be that+			 * the "if (bin == NULL)" branch above will be entered+			 * in every iteration of the while loop, but it+			 * shouldn't be a huge issue, because (a) the+			 * computations there are cheap and (b) users+			 * requesting to bypass the tcache care less about+			 * performance.+			 */ 			if (bin != NULL) { 				size_t bin_batch = batch - progress;+				/*+				 * n can be less than bin_batch, meaning that+				 * the cache bin does not have enough memory.+				 * In such cases, we rely on the slow path,+				 * i.e. the mallocx() call at the end of the+				 * while loop, to fill in the cache, and in the+				 * next iteration of the while loop, the tcache+				 * will contain a lot of memory, and we can+				 * harvest them here.  Compared to the+				 * alternative approach where we directly go to+				 * the arena bins here, the overhead of our+				 * current approach should usually be minimal,+				 * since the tcache is usually larger than a

Is this true? I think lg_fill_div is often positive, so we fill less than a slab amount into tcaches.

yinan1048576

comment created time in 17 days

Pull request review commentjemalloc/jemalloc

Route batch allocation of small batch size to tcache

 batch_alloc(void **ptrs, size_t num, size_t size, int flags) { 					bin = &tcache->bins[ind]; 				} 			}+			/*+			 * If we don't have a tcache bin, we don't want to+			 * immediately give up, because there's the possibility

This (i.e. that the user specified MALLOCX_TCACHE_NONE) is the only possibility, right?

yinan1048576

comment created time in 17 days

push eventskiplang/skip

Julien Verlaguet

commit sha dc27634ad89333c450a7d183832a2b0c75189337

Disabling the logic zeroing gaps between fields in 32bits mode. (#201) When initializing an object, the logic "zeros" the gaps between the fields. I don't know why but I suspect it has to do with the way interning works. The hash-consing of an object probably scans the memory of the object at some point. The hash could give different results if it included gaps with different values. However, the interning in 32bits mode doesn't require this to happen. So disabling it for now. It would probably be cleaner to fix the math so that it also works in 32bits mode, but it's not urgent.

view details

push time in 19 days

PR merged skiplang/skip

Disabling the logic zeroing gaps between fields in 32bits mode. CLA Signed

When initializing an object, the logic "zeros" the gaps between the fields. I don't know why but I suspect it has to do with the way interning works. The hash-consing of an object probably scans the memory of the object at some point. The hash could give different results if it included gaps with different values.

However, the interning in 32bits mode doesn't require this to happen. So disabling it for now. It would probably be cleaner to fix the math so that it also works in 32bits mode, but it's not urgent.

<!--- Provide a general summary of your changes in the Title above -->

Description

<!--- Describe your changes in detail -->

Motivation and Context

<!--- Why is this change required? What problem does it solve? --> <!--- If it fixes an open issue, please link to the issue here. -->

How Has This Been Tested?

<!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. -->

Types of changes

<!--- What types of changes does your code introduce? Put an x in all the boxes that apply: -->

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Documentation change

Checklist:

<!--- Go over all the following points, and put an x in all the boxes that apply. --> <!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->

  • [ ] My code follows the code style of this project.
  • [ ] My change requires a change to the documentation.
  • [ ] I have updated the documentation accordingly.
  • [ ] I have read the CONTRIBUTING document.
  • [ ] I have added tests to cover my changes.
  • [ ] All new and existing tests passed.
  • [ ] I have updated docs/
+6 -1

0 comment

1 changed file

pikatchu

pr closed time in 19 days

more