profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/shreevatsa/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.

shreevatsa/sanskrit 46

Tool(s) to help read Sanskrit (and other) metrical verse

shreevatsa/knuth-literate-programs 26

Examples of literate programming by Knuth

shreevatsa/intranslit 4

Transliteration for Indian (and possibly other) scripts

shreevatsa/pandoc-mathjax-filter 4

A Pandoc filter for typesetting (rendering) TeX snippets server-side, using mathjax-node

shreevatsa/tex 3

TeX-related stuff

shreevatsa/nmisra-arapv 2

LaTeX source files for the Sanskrit book “Adhyātmarāmāyaṇe’pāṇinīyaprayogāṇāṃ Vimarśaḥ” from https://sites.google.com/site/nmisra/ARAPV.zip

shreevatsa/webWEB 2

WEB/CWEB on the web

shreevatsa/site 1

Yet another attempt to write

shreevatsa/zipf-sanskrit 1

(Done) Some code I wrote to plot frequency of words for Sanskrit texts, to look into whether Zipf's law holds

issue commenttikv/pprof-rs

Segmentation fault / memory leak

There are some known situations where pprof-rs may cause segmentation fault:

If your program is getting backtrace (e.g. other profilers are running, the Error generates a backtrace automatically...)

Ah, it was the latter — this was my first Rust program more-or-less, and initially I had .unwrap() everywhere, and at some point I gathered the (incorrect it looks like?) impression that it was "better" to use Result types and change every .unwrap() to ? — but this resulted in no longer getting stack traces on error, so I added it back as here. So I guess that's the problem here, trying to do both profiling and have Error generate backtraces. (In the next program I guess I'll just use unwrap() except where I plan to handle the error… not yet sure how to get stack traces in that case though; losing stack traces seems like a reason to avoid Error/Result types!)

Running the program under leak sanitizer showed a leak from objects allocated from here I think (see output in comment elsewhere), though maybe I've misunderstood.

I don't know how the LeakSanitizer handles the lazy_static objects. It seems that all memory allocated from Profiler::new is regarded as a leak. However, all these memory is referenced from the global lazy_static variable 🤦‍♂️ .

Thanks, I was suspecting that there wasn't an actual leak. Looks like LeakSanitizer blamed pprof-rs for incorrect reasons, but removing it made the segmentation fault go away.

shreevatsa

comment created time in 3 days

issue openedtikv/pprof-rs

Segmentation fault / memory leak

Hi, thank you for this really nice crate. I couldn't get most other profilers to work on a macOS laptop (x86_64-apple-darwin), while this one works easily and generates useful flamegraphs.

Unfortunately it seems to be causing a segmentation fault in my program when I use it. There's no backtrace and the location of the crash seems to be non-deterministic so I have only indirect evidence:

  • There was a segmentation fault when I was using pprof::ProfilerGuard and it went away when I removed it.
  • Running the program under leak sanitizer showed a leak from objects allocated from here I think (see output in comment elsewhere), though maybe I've misunderstood.

Maybe the issue is not with this pprof directly but with the combination of this and some other dependency, but I doubt I have the ability to debug further. Just filing this here in case the issue is obvious or it helps somehow,

created time in 4 days

issue commentshreevatsa/pdf-glyph-mapping

Segmentation fault

This was frustrating to debug as it was non-deterministic (even with println! logging, turns out it was not always crashing at the same place), but I found the problem using one of the sanitizers following the instructions at https://github.com/japaric/rust-san —

for SAN in address leak memory thread; do
    export RUSTFLAGS=-Zsanitizer=$SAN RUSTDOCFLAGS=-Zsanitizer=$SAN
    RUST_BACKTRACE=full cargo +nightly run -Zbuild-std --target x86_64-apple-darwin --bin dump-tjs -- ../../gp-mbh/unabridged.pdf font-usage --phase phase1
done

(with a small 1-page PDF file as input). It seems the memory sanitizer doesn't work on this architecture, but the leak one showed:

<details> <summary>the output</summary> <p>

==35735==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 261568 byte(s) in 1 object(s) allocated from:
    #0 0x10e0f4368 in wrap_malloc+0x58 (librustc-nightly_rt.lsan.dylib:x86_64h+0x8368)
    #1 0x10d4eb168 in std::sys::unix::alloc::_$LT$impl$u20$core..alloc..global..GlobalAlloc$u20$for$u20$std..alloc..System$GT$::alloc::h470ffdc85d4d4861+0x88 (dump-tjs:x86_64+0x10054e168)
    #2 0x10d50c658 in __rdl_alloc+0x38 (dump-tjs:x86_64+0x10056f658)
    #3 0x10d054af6 in alloc::alloc::alloc::h3263eab0325b5109+0x36 (dump-tjs:x86_64+0x1000b7af6)
    #4 0x10d054b75 in alloc::alloc::Global::alloc_impl::hb2c40b5c858fb521+0x65 (dump-tjs:x86_64+0x1000b7b75)
    #5 0x10d054e5a in _$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::h3de4dcca1921f5e9+0x1a (dump-tjs:x86_64+0x1000b7e5a)
    #6 0x10d054a68 in alloc::alloc::exchange_malloc::h84da2d93f524304d+0x38 (dump-tjs:x86_64+0x1000b7a68)
    #7 0x10d0507bf in pprof::collector::TempFdArray$LT$T$GT$::new::hea726b904867fdf6+0x19f (dump-tjs:x86_64+0x1000b37bf)
    #8 0x10d051550 in pprof::collector::Collector$LT$T$GT$::new::h04eb4a6cd7916669+0x30 (dump-tjs:x86_64+0x1000b4550)
    #9 0x10d05dba4 in pprof::profiler::Profiler::new::hec63829a994cc04a+0x24 (dump-tjs:x86_64+0x1000c0ba4)
    #10 0x10d055028 in core::ops::function::FnOnce::call_once::h32ec437d7c5003f7+0x18 (dump-tjs:x86_64+0x1000b8028)
    #11 0x10d051b11 in lazy_static::lazy::Lazy$LT$T$GT$::get::_$u7b$$u7b$closure$u7d$$u7d$::h6bc7232ade22f7df+0x21 (dump-tjs:x86_64+0x1000b4b11)
    #12 0x10d060f98 in std::sync::once::Once::call_once::_$u7b$$u7b$closure$u7d$$u7d$::h820e6dd8713b2457+0x38 (dump-tjs:x86_64+0x1000c3f98)
    #13 0x10d5d636d in std::sync::once::Once::call_inner::hf2527b5d031ff925+0x1ed (dump-tjs:x86_64+0x10063936d)
    #14 0x10d060f34 in std::sync::once::Once::call_once::h94db1d7650bb3e50+0x74 (dump-tjs:x86_64+0x1000c3f34)
    #15 0x10d05e54f in _$LT$pprof..profiler..PROFILER$u20$as$u20$core..ops..deref..Deref$GT$::deref::h06d87b55e2a28aa2+0x2f (dump-tjs:x86_64+0x1000c154f)
    #16 0x10d05cba5 in pprof::profiler::trigger_lazy::h3494183b22569d53+0x25 (dump-tjs:x86_64+0x1000bfba5)
    #17 0x10d05cbf6 in pprof::profiler::ProfilerGuard::new::h2acd95977bc6a75f+0x26 (dump-tjs:x86_64+0x1000bfbf6)
    #18 0x10d028be5 in dump_tjs::main::hfe8da54ef2505aab dump-tjs.rs:129
    #19 0x10d012a0d in core::ops::function::FnOnce::call_once::h67791a50b38bcfe2 function.rs:227
    #20 0x10cfd9830 in std::sys_common::backtrace::__rust_begin_short_backtrace::h4c9dad599b86be2a backtrace.rs:125
    #21 0x10cfcc173 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hce484e6ba64dde89 rt.rs:63
    #22 0x10d4fb2d2 in core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnOnce$LT$A$GT$$u20$for$u20$$RF$F$GT$::call_once::h0c84ea5dfec15b1a+0x12 (dump-tjs:x86_64+0x10055e2d2)
    #23 0x10d533d69 in std::panicking::try::do_call::hb37852e00490aef1+0x39 (dump-tjs:x86_64+0x100596d69)
    #24 0x10d5352aa in __rust_try+0x2a (dump-tjs:x86_64+0x1005982aa)
    #25 0x10d533bb8 in std::panicking::try::ha93df4408ed6a17b+0x68 (dump-tjs:x86_64+0x100596bb8)
    #26 0x10d4a08cb in std::panic::catch_unwind::hb872d3ec40b93f64+0x1b (dump-tjs:x86_64+0x1005038cb)
    #27 0x10d4d903e in std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::h79004bbf634c4351+0x1e (dump-tjs:x86_64+0x10053c03e)
    #28 0x10d533cb9 in std::panicking::try::do_call::h5efaa05777cc4a5e+0x39 (dump-tjs:x86_64+0x100596cb9)
    #29 0x10d5352aa in __rust_try+0x2a (dump-tjs:x86_64+0x1005982aa)

Direct leak of 65536 byte(s) in 1 object(s) allocated from:
    #0 0x10e0f4368 in wrap_malloc+0x58 (librustc-nightly_rt.lsan.dylib:x86_64h+0x8368)
    #1 0x10d4eb168 in std::sys::unix::alloc::_$LT$impl$u20$core..alloc..global..GlobalAlloc$u20$for$u20$std..alloc..System$GT$::alloc::h470ffdc85d4d4861+0x88 (dump-tjs:x86_64+0x10054e168)
    #2 0x10d50c658 in __rdl_alloc+0x38 (dump-tjs:x86_64+0x10056f658)
    #3 0x10d054af6 in alloc::alloc::alloc::h3263eab0325b5109+0x36 (dump-tjs:x86_64+0x1000b7af6)
    #4 0x10d054b75 in alloc::alloc::Global::alloc_impl::hb2c40b5c858fb521+0x65 (dump-tjs:x86_64+0x1000b7b75)
    #5 0x10d054e5a in _$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::h3de4dcca1921f5e9+0x1a (dump-tjs:x86_64+0x1000b7e5a)
    #6 0x10d054a68 in alloc::alloc::exchange_malloc::h84da2d93f524304d+0x38 (dump-tjs:x86_64+0x1000b7a68)
    #7 0x10d0500cf in _$LT$pprof..collector..StackHashCounter$LT$T$GT$$u20$as$u20$core..default..Default$GT$::default::h5dad465f06daaf7c+0x4f (dump-tjs:x86_64+0x1000b30cf)
    #8 0x10d05153d in pprof::collector::Collector$LT$T$GT$::new::h04eb4a6cd7916669+0x1d (dump-tjs:x86_64+0x1000b453d)
    #9 0x10d05dba4 in pprof::profiler::Profiler::new::hec63829a994cc04a+0x24 (dump-tjs:x86_64+0x1000c0ba4)
    #10 0x10d055028 in core::ops::function::FnOnce::call_once::h32ec437d7c5003f7+0x18 (dump-tjs:x86_64+0x1000b8028)
    #11 0x10d051b11 in lazy_static::lazy::Lazy$LT$T$GT$::get::_$u7b$$u7b$closure$u7d$$u7d$::h6bc7232ade22f7df+0x21 (dump-tjs:x86_64+0x1000b4b11)
    #12 0x10d060f98 in std::sync::once::Once::call_once::_$u7b$$u7b$closure$u7d$$u7d$::h820e6dd8713b2457+0x38 (dump-tjs:x86_64+0x1000c3f98)
    #13 0x10d5d636d in std::sync::once::Once::call_inner::hf2527b5d031ff925+0x1ed (dump-tjs:x86_64+0x10063936d)
    #14 0x10d060f34 in std::sync::once::Once::call_once::h94db1d7650bb3e50+0x74 (dump-tjs:x86_64+0x1000c3f34)
    #15 0x10d05e54f in _$LT$pprof..profiler..PROFILER$u20$as$u20$core..ops..deref..Deref$GT$::deref::h06d87b55e2a28aa2+0x2f (dump-tjs:x86_64+0x1000c154f)
    #16 0x10d05cba5 in pprof::profiler::trigger_lazy::h3494183b22569d53+0x25 (dump-tjs:x86_64+0x1000bfba5)
    #17 0x10d05cbf6 in pprof::profiler::ProfilerGuard::new::h2acd95977bc6a75f+0x26 (dump-tjs:x86_64+0x1000bfbf6)
    #18 0x10d028be5 in dump_tjs::main::hfe8da54ef2505aab dump-tjs.rs:129
    #19 0x10d012a0d in core::ops::function::FnOnce::call_once::h67791a50b38bcfe2 function.rs:227
    #20 0x10cfd9830 in std::sys_common::backtrace::__rust_begin_short_backtrace::h4c9dad599b86be2a backtrace.rs:125
    #21 0x10cfcc173 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hce484e6ba64dde89 rt.rs:63
    #22 0x10d4fb2d2 in core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnOnce$LT$A$GT$$u20$for$u20$$RF$F$GT$::call_once::h0c84ea5dfec15b1a+0x12 (dump-tjs:x86_64+0x10055e2d2)
    #23 0x10d533d69 in std::panicking::try::do_call::hb37852e00490aef1+0x39 (dump-tjs:x86_64+0x100596d69)
    #24 0x10d5352aa in __rust_try+0x2a (dump-tjs:x86_64+0x1005982aa)
    #25 0x10d533bb8 in std::panicking::try::ha93df4408ed6a17b+0x68 (dump-tjs:x86_64+0x100596bb8)
    #26 0x10d4a08cb in std::panic::catch_unwind::hb872d3ec40b93f64+0x1b (dump-tjs:x86_64+0x1005038cb)
    #27 0x10d4d903e in std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::h79004bbf634c4351+0x1e (dump-tjs:x86_64+0x10053c03e)
    #28 0x10d533cb9 in std::panicking::try::do_call::h5efaa05777cc4a5e+0x39 (dump-tjs:x86_64+0x100596cb9)
    #29 0x10d5352aa in __rust_try+0x2a (dump-tjs:x86_64+0x1005982aa)

Indirect leak of 17563648 byte(s) in 4096 object(s) allocated from:
    #0 0x10e0f4368 in wrap_malloc+0x58 (librustc-nightly_rt.lsan.dylib:x86_64h+0x8368)
    #1 0x10d4eb168 in std::sys::unix::alloc::_$LT$impl$u20$core..alloc..global..GlobalAlloc$u20$for$u20$std..alloc..System$GT$::alloc::h470ffdc85d4d4861+0x88 (dump-tjs:x86_64+0x10054e168)
    #2 0x10d50c658 in __rdl_alloc+0x38 (dump-tjs:x86_64+0x10056f658)
    #3 0x10d054af6 in alloc::alloc::alloc::h3263eab0325b5109+0x36 (dump-tjs:x86_64+0x1000b7af6)
    #4 0x10d054b75 in alloc::alloc::Global::alloc_impl::hb2c40b5c858fb521+0x65 (dump-tjs:x86_64+0x1000b7b75)
    #5 0x10d054e5a in _$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::h3de4dcca1921f5e9+0x1a (dump-tjs:x86_64+0x1000b7e5a)
    #6 0x10d054a68 in alloc::alloc::exchange_malloc::h84da2d93f524304d+0x38 (dump-tjs:x86_64+0x1000b7a68)
    #7 0x10d04f96f in _$LT$pprof..collector..Bucket$LT$T$GT$$u20$as$u20$core..default..Default$GT$::default::h28014b09fd497944+0x4f (dump-tjs:x86_64+0x1000b296f)
    #8 0x10d050178 in _$LT$pprof..collector..StackHashCounter$LT$T$GT$$u20$as$u20$core..default..Default$GT$::default::_$u7b$$u7b$closure$u7d$$u7d$::h78e2260d372dfc92+0x18 (dump-tjs:x86_64+0x1000b3178)
    #9 0x10d060dda in _$LT$core..slice..iter..IterMut$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::for_each::hcd72468f34694ebd+0x6a (dump-tjs:x86_64+0x1000c3dda)
    #10 0x10d05014e in _$LT$pprof..collector..StackHashCounter$LT$T$GT$$u20$as$u20$core..default..Default$GT$::default::h5dad465f06daaf7c+0xce (dump-tjs:x86_64+0x1000b314e)
    #11 0x10d05153d in pprof::collector::Collector$LT$T$GT$::new::h04eb4a6cd7916669+0x1d (dump-tjs:x86_64+0x1000b453d)
    #12 0x10d05dba4 in pprof::profiler::Profiler::new::hec63829a994cc04a+0x24 (dump-tjs:x86_64+0x1000c0ba4)
    #13 0x10d055028 in core::ops::function::FnOnce::call_once::h32ec437d7c5003f7+0x18 (dump-tjs:x86_64+0x1000b8028)
    #14 0x10d051b11 in lazy_static::lazy::Lazy$LT$T$GT$::get::_$u7b$$u7b$closure$u7d$$u7d$::h6bc7232ade22f7df+0x21 (dump-tjs:x86_64+0x1000b4b11)
    #15 0x10d060f98 in std::sync::once::Once::call_once::_$u7b$$u7b$closure$u7d$$u7d$::h820e6dd8713b2457+0x38 (dump-tjs:x86_64+0x1000c3f98)
    #16 0x10d5d636d in std::sync::once::Once::call_inner::hf2527b5d031ff925+0x1ed (dump-tjs:x86_64+0x10063936d)
    #17 0x10d060f34 in std::sync::once::Once::call_once::h94db1d7650bb3e50+0x74 (dump-tjs:x86_64+0x1000c3f34)
    #18 0x10d05e54f in _$LT$pprof..profiler..PROFILER$u20$as$u20$core..ops..deref..Deref$GT$::deref::h06d87b55e2a28aa2+0x2f (dump-tjs:x86_64+0x1000c154f)
    #19 0x10d05cba5 in pprof::profiler::trigger_lazy::h3494183b22569d53+0x25 (dump-tjs:x86_64+0x1000bfba5)
    #20 0x10d05cbf6 in pprof::profiler::ProfilerGuard::new::h2acd95977bc6a75f+0x26 (dump-tjs:x86_64+0x1000bfbf6)
    #21 0x10d028be5 in dump_tjs::main::hfe8da54ef2505aab dump-tjs.rs:129
    #22 0x10d012a0d in core::ops::function::FnOnce::call_once::h67791a50b38bcfe2 function.rs:227
    #23 0x10cfd9830 in std::sys_common::backtrace::__rust_begin_short_backtrace::h4c9dad599b86be2a backtrace.rs:125
    #24 0x10cfcc173 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hce484e6ba64dde89 rt.rs:63
    #25 0x10d4fb2d2 in core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnOnce$LT$A$GT$$u20$for$u20$$RF$F$GT$::call_once::h0c84ea5dfec15b1a+0x12 (dump-tjs:x86_64+0x10055e2d2)
    #26 0x10d533d69 in std::panicking::try::do_call::hb37852e00490aef1+0x39 (dump-tjs:x86_64+0x100596d69)
    #27 0x10d5352aa in __rust_try+0x2a (dump-tjs:x86_64+0x1005982aa)
    #28 0x10d533bb8 in std::panicking::try::ha93df4408ed6a17b+0x68 (dump-tjs:x86_64+0x100596bb8)
    #29 0x10d4a08cb in std::panic::catch_unwind::hb872d3ec40b93f64+0x1b (dump-tjs:x86_64+0x1005038cb)

SUMMARY: LeakSanitizer: 17890752 byte(s) leaked in 4098 allocation(s).

</p> </details>

So I just removed the profiler (the only one I had found to work in the first place :-( ) and it seems to not crash anymore; give it a try.

(Haven't looked any further to find why this happens; will report it to https://github.com/tikv/pprof-rs and leave it at that.)

ujjvlh

comment created time in 4 days

push eventshreevatsa/pdf-glyph-mapping

Shreevatsa R

commit sha bd6fe8ddc3fb4a9c1e0f519d1eabe196bf257f66

(m) put the main parameter first

view details

Shreevatsa R

commit sha 80a52246cb40a7110c4276e15e6ea51475eae943

may make this more efficient, need to check

view details

Shreevatsa R

commit sha 3e8a65ef79a88c5249d50908835c3aef5452d38c

Finally found the cause of the segmentation fault #5

view details

push time in 4 days

issue commentshreevatsa/site

tex-other: a couple of Go-based implementations

BTW I started a new discussion and mentioned your project at https://github.com/shreevatsa/webWEB/discussions/18 — some of the people there are interested in WEB and/or CWEB, so I'm sure they'll enjoy your work!

sbinet

comment created time in 8 days

issue closedshreevatsa/site

tex-other: a couple of Go-based implementations

hi,

I believe you are the main curator of github.com/tex-other. I may have 2 Go-based implementations to add to your curated list:

  • go-latex/latex (basically a translation of matplotlib's "mathtex" mode; not a complete TeX implementation)
  • star-tex (on SourceHut) (a complete TeX implementation, automatically transliterated from .WEB to Go)

disclaimer: I am the main guilty party behind those attempts.

closed time in 8 days

sbinet

issue commentshreevatsa/site

tex-other: a couple of Go-based implementations

Sorry for the delay, and thank you so much once again! Really interesting and I hope to study them in detail soon; for now I've grabbed copies of all three of these into github.com/tex-other :-)

Really cool to see the automatically generated Go code, especially. Thanks for sharing!

sbinet

comment created time in 8 days

create barnchtex-other/web2go

branch : master

created branch time in 8 days

created repositorytex-other/web2go

Fork of https://gitlab.com/cznic/web2go which was used to generate star-tex

created time in 8 days

push eventshreevatsa/pdf-glyph-mapping

Shreevatsa R

commit sha 3b97ad809f968444c661b98b86de62362d7588b5

'Inline' functions that were only called from one location.

view details

push time in 9 days

push eventshreevatsa/pdf-glyph-mapping

Shreevatsa R

commit sha b70f5c95afb27fa4a3c4b183b55d61a386b9e854

'Inline' functions that were only called from one location.

view details

push time in 9 days

startedjosch/img2pdf

started time in 11 days

startedmyollie/img2pdf

started time in 11 days

issue commentshreevatsa/site

tex-other: a couple of Go-based implementations

Wow that's great, thank you so much! I'll take a look and add it to the list when I next get some time at the computer…

For the star-tex, the "web2go" sounds interesting too… is that also available somewhere?

sbinet

comment created time in 23 days

issue commentshreevatsa/pdf-glyph-mapping

Extract Formatting

Ah ok, I hadn't noticed the italics... If there are then yes it must be as you say (without using a separate font). I'll take a look sometime.

On Sun, Aug 22, 2021, 9:56 PM उ॒ज्ज्व॒लः ***@***.***> wrote:

Are there italics on any page in the unabridged PDF we were looking at? Because I'd have thought that if so, like regular and bold, there would also be an italic font in the PDF.

All the sentences ending in उवाच, mentioning the speaker of the following verses. And the Hindi endings of adhyāyas. Are you sure that a separate font needs to be included, that faux italic (or bold) is not produced by the PDF rendering program?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/shreevatsa/pdf-glyph-mapping/issues/9#issuecomment-903442922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF3MKROQ5BHB7FDZODCZZDT6HIF7ANCNFSM5CNTJN2A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

ujjvlh

comment created time in 24 days

issue commentshreevatsa/pdf-glyph-mapping

Extract Formatting

Good idea, though it will be a while before I get to this… please remind me in three weeks if I've not replied again here by then. :-)

Are there italics on any page in the unabridged PDF we were looking at? Because I'd have thought that if so, like regular and bold, there would also be an italic font in the PDF.

ujjvlh

comment created time in a month

push eventshreevatsa/pdf-glyph-mapping

Shreevatsa R

commit sha 5d2abc79f3e5bcc200ad9000801498876d99845e

Add some background

view details

push time in a month

push eventshreevatsa/pdf-glyph-mapping

Shreevatsa R

commit sha 5640cb1ef5302f272a9a92167ac6bfcf2642b5e2

removed from Drive so commit here

view details

push time in a month

issue commentshreevatsa/pdf-glyph-mapping

Save some more of the intermediate results produced in dump-tjs

I think that's what the "phase1" was supposed to do (it just dumps the glyph ids used for each text operation): https://github.com/shreevatsa/pdf-glyph-mapping/blob/af4cf8ba6f152b7bc8e34ecafe436d8d60394c60/work/Makefile#L39-L41

But I guess we could make it better by splitting text per page (and maybe all fonts on that page together…), and replace the second run https://github.com/shreevatsa/pdf-glyph-mapping/blob/af4cf8ba6f152b7bc8e34ecafe436d8d60394c60/work/Makefile#L64-L66 with something that just works on the dumped sequences and generates corresponding text directly, so that we don't have to generate the PDF (which is very slow) and run pdftotext on it.

Then when the text seems satisfactory, generating the new PDF can be the last step.

ujjvlh

comment created time in a month

push eventshreevatsa/pdf-glyph-mapping

ujjvlh

commit sha dd50eedda3db5da2c12e272610c8cb921e4b4b17

regex

view details

ujjvlh

commit sha da729469fe9e2bced7e1133ee828486dc95878ee

text.py

view details

ujjvlh

commit sha 80ffbdcec4056e1aaf28798c0083408e24121964

text.py

view details

ujjvlh

commit sha 1e4b35e649fcc0bdf968d501a0a284e6e97b1f32

text.py

view details

ujjvlh

commit sha 11cc4f8b87036901e36a1b7c2c6ce25b1e1235fe

text.py

view details

Shreevatsa

commit sha af4cf8ba6f152b7bc8e34ecafe436d8d60394c60

Merge pull request #7 from ujjvlh/main regex

view details

push time in a month

PR merged shreevatsa/pdf-glyph-mapping

regex

This covers all cases (at least in the mbh PDF) now.

+27 -2

0 comment

2 changed files

ujjvlh

pr closed time in a month

Pull request review commentshreevatsa/pdf-glyph-mapping

regex

+import pdftotext+import re++i = 0++s = []++for pg in pdftotext.PDF(open('../../gp-mbh/unabridged.fixed.pdf', 'rb')):+  a = ''+  c = pg+  while a != c:+    a = c+    b = re.sub(r'(.)<CCsucc>(([क-हक़-य़]़?्)*[क-हक़-य़]़?)', r'\2\1', a)+    c = re.sub(r'(([क-हक़-य़]़?्)*[क-हक़-य़ऋ][^क-हक़-य़ऋ]*)र्<CCprec>', r'र्\1', b)

Just out of curiosity: In what kind of words is "ऋ" needed here? निरृति ?

ujjvlh

comment created time in a month

Pull request review commentshreevatsa/pdf-glyph-mapping

regex

 def recurse(i, cur):   def normalize(r):

This function is not actually being called so we could even just remove this I guess (to avoid having it in two places).

ujjvlh

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

push eventshreevatsa/pdf-glyph-mapping

Shreevatsa R

commit sha 7deb635a7d64bea59162b103c3ff924a033cc512

Check in the file I used to generate the spreadsheet, just in case it's useful later

view details

Shreevatsa R

commit sha 02a82c3058d62aeb11e1df2a807cf04d667fed8a

some special cases

view details

Shreevatsa R

commit sha 17393f277d345e1de99f003d26b2a744d0f4d5f6

Allow multiple font files

view details

Shreevatsa R

commit sha 445e39adfa6e2b9aa212925d8f3c32fe488b71c2

Try going back to older version... doesn't seem to help

view details

Shreevatsa R

commit sha d33418661fe4477bf5192da359402938d014caaa

commit these, to unblock others

view details

Shreevatsa R

commit sha d056c01dd9cf2e277d883f49e39b3ae0225533d6

save this too, just for future study

view details

push time in a month

issue commentshreevatsa/pdf-glyph-mapping

Segmentation fault

I've downloaded the file but meanwhile I've started getting segmentation fault up too even with the file I was using earlier... Will take a look sometime (I think there's a way to run rust programs with ASan which may point to the problem). How did you narrow it down to that line of code earlier?

On Sun, Aug 15, 2021, 8:53 PM उ॒ज्ज्व॒लः ***@***.***> wrote:

I'll download the file and try it when I have access,

Sorry. Just changed the view permissions to "anyone with the link". Will now try with the updated code.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/shreevatsa/pdf-glyph-mapping/issues/5#issuecomment-899193658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF3MKUO6VAYIZBOECXUEQLT5CDTBANCNFSM5CC6AUJQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

ujjvlh

comment created time in a month

issue commentshreevatsa/pdf-glyph-mapping

cargo invocation problem

I'll take a look later, but you should be able to remove the "+nightly". It was needed only for more detailed errors.

(Or just use the spreadsheet I just posted to the group, without bothering with this repo at all)

On Sun, Aug 15, 2021, 8:27 PM Vishvas Vasuki विश्वासः < ***@***.***> wrote:

[vvasuki:~/shreevatsa/pdf-glyph-mapping/work:main]─[08:54:13]─$make

RUST_BACKTRACE=1 cargo +nightly run --release --bin dump-tjs -- /home/vvasuki/Documents/books/granthasangrahaH/purANam/unabridged_full.pdf font-usage/ --phase phase1

error: no such subcommand: +nightly

make: *** [Makefile:41: font-usage/] Error 101

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/shreevatsa/pdf-glyph-mapping/issues/6, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF3MKSDDY7HSR645TF6ACTT5CAQXANCNFSM5CG4I7WA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

vvasuki

comment created time in a month

push eventshreevatsa/pdf-glyph-mapping

Shreevatsa R

commit sha d42a28aefd5eb3b37279faa55f95e84bc8c73ad2

ok it's not THAT slow; I was running the debug build by mistake

view details

push time in a month

issue commentshreevatsa/pdf-glyph-mapping

Segmentation fault

I'll download the file and try it when I have access, but I also made some changes to the code and the problem may have gone away now (though I can't see why); please try that too.

I was able to run the current version on the entire unabridged PDF from the Internet Archive without errors (earlier it wouldn't have worked and required the qpdf pass as in #3 but now it works on the file directly).

ujjvlh

comment created time in a month