profile
viewpoint
遗忘
Cesar Eduardo Barros cesarb Rio de Janeiro

cesarb/blake2-rfc 57

A pure Rust implementation of BLAKE2 based on RFC 7693.

cesarb/clear_on_drop 43

Helpers for clearing sensitive data on the stack and heap

cesarb/chacha20-poly1305-aead 17

A pure Rust implementation of the ChaCha20-Poly1305 AEAD from RFC 7539.

cesarb/constant_time_eq 14

Compares two equal-sized byte strings in constant time.

cesarb/android-cellid 3

Simple Android application to show the current GSM cell ID

cesarb/filestatrec 3

record mtime and mode for files in a git repository

cesarb/blake2-bench 1

Comparative benchmarking for Rust implementations of the BLAKE2 hash functions

cesarb/glxswapbuffersmeasure 1

measure latency between glXSwapBuffers calls

cesarb/packageaddedremovednotifier 1

Notifies and logs packages added/removed on your Android phone or tablet

cesarb/ripgrep 1

ripgrep combines the usability of The Silver Searcher with the raw speed of grep.

push eventcesarb/BLAKE3

Cesar Eduardo Barros

commit sha 62a7db2951f5e2419bbb240c186d0a2feabed724

Use non-temporal copy for input buffer

view details

push time in a month

pull request commentBLAKE3-team/BLAKE3

Vulkan implementation of b3sum

I guess I finally found what I was doing wrong: my laptop CPU is too fast. Trying on an older laptop with integrated GPU (Haswell), with a pair of files ~750M total added twice to the command line (so 4 files with ~1500M total), it took 0,6s without --vulkan, but only 0,5s with --vulkan. The hash is correct even though Mesa warns that "Haswell Vulkan support is incomplete".

So it seems the result is mixed; depending on your CPU and GPU, either could be the faster one.

cesarb

comment created time in a month

issue commentrust-lang/rust

asm!: `options` should work multiple times, to simplify macros

with the semantic that no conflicting options may be provided. (We could also do "last option wins", but I think I'd slightly prefer "don't specify conflicting options" instead.)

I think a simpler semantic would be "multiple options is identical to a single options with the options concatenated". That is, options(A, B), options(C, D) would have the exact same semantics as options(A, B, C, D).

This would be good not only for macros, but also for readability (some people might prefer multiple options, one on each line, instead of being forced to use a single options, when the list of options starts to get too long).

joshtriplett

comment created time in a month

pull request commentrust-lang/rust

Allow unused arguments in asm!

So, I'd personally prefer not extend the syntax util a case for which asm!("/* {} */", arg); doesn't work is discovered in practice.

You cannot guarantee that a future assembly syntax for a future ISA will not use /* for a non-comment purpose. That is, if you depend on comments to fool the "unused arguments" error, you need a comment syntax that is guaranteed to be the same for every current or future ISA.

That said, black_box and similar are "special": any other use of inline assembly will have actual assembly code, and therefore be specific to a single ISA or ISA family, so the writer of that inline assembly will know the correct comment syntax. Therefore, another option would be to add a compiler intrinsic for black_box, but that compiler intrinsic would have to be flexible enough for all optimization barrier use cases.

There are at least two main cases that I can think of: register value (asm!("", inlateout(reg) value, options(pure, nomem, nostack, preserves_flags)), which could be used for instance for the constant_time_eq crate) and memory reference (asm!("", in(reg) &mut value, options(nostack, preserves_flags))), but there might be others (the clear_on_drop crate, for instance, has one variant for sized and one variant for unsized values, though they would be the same with the new asm! since it doesn't have memory operands yet). With memory operands, one could have a third main case which is a hybrid of the other two, something like asm!("", inout(mem) &mut value, options(pure, nostack, preserves_flags)).

So I think that the simplest and most flexible approach is to just allow arguments to be missing from the template string, but still be treated as used by the compiler, either through a lint which can be allowed, or through a special named argument or some other syntax trickery. There is a lot of precedent for that (not only the Rust black_box function, but also the Linux kernel's RELOC_HIDE, OPTIMIZER_HIDE_VAR, and barrier/barrier_data macros).

Amanieu

comment created time in a month

issue commentcesarb/clear_on_drop

Error compiling with latest nightly - asm syntax outdated

I changed it to use llvm_asm! (unfortunately, the new asm! cannot be used due to https://github.com/rust-lang/rust/issues/72965) and released 0.2.4, could you check if it it works for you, and if it does, close this issue?

dunnock

comment created time in a month

PR closed cesarb/clear_on_drop

Update hide.rs from asm! to llvm_asm!

asm! is now deprecated in nightly rust and will pass errors when compiling.

I have changed two lines to the newer version, llvm_asm! so it can compile.

As soon as possible, can you release the next version (0.2.4) so this can be fixed with the changes mentioned.

+2 -2

1 comment

1 changed file

0xSilene

pr closed time in a month

pull request commentcesarb/clear_on_drop

Update hide.rs from asm! to llvm_asm!

This pull request failed CI (missing the correct #![feature]), I did it by hand on master.

0xSilene

comment created time in a month

push eventcesarb/clear_on_drop

Cesar Eduardo Barros

commit sha ca172eb253a6169f92b08f9136732da8d36b61c1

Release 0.2.4

view details

push time in a month

created tagcesarb/clear_on_drop

tag0.2.4

Helpers for clearing sensitive data on the stack and heap

created time in a month

push eventcesarb/clear_on_drop

Cesar Eduardo Barros

commit sha ce59e69b91387eb4fefe542e6d5dcca0a7ace942

Use old asm! syntax The new asm! syntax is currently unable to do this trick, see https://github.com/rust-lang/rust/issues/72965 for details.

view details

push time in a month

issue openedrust-lang/rust

"argument never used" in new asm! syntax

When trying to use the new asm! syntax to implement something similar to std::hint::black_box, it fails with an "argument never used" error. For example (https://godbolt.org/z/mG9AMg):

#![feature(asm)]

pub fn black_box(dummy: i32) -> i32 {
    unsafe {
        asm!("", in(reg) &dummy, options(nostack));
        dummy
    }
}
error: argument never used

 --> <source>:5:18

  |

5 |         asm!("", in(reg) &dummy, options(nostack));

  |                  ^^^^^^^^^^^^^^ argument never used


error: aborting due to previous error


Compiler returned: 1

This makes it impossible to use the new asm! syntax in one of my crates (https://github.com/cesarb/clear_on_drop/commit/43cd2be18076c424437fbdc302bec22853291b27 with CI output at https://travis-ci.org/github/cesarb/clear_on_drop/jobs/694467164).

created time in a month

push eventcesarb/clear_on_drop

Cesar Eduardo Barros

commit sha 43cd2be18076c424437fbdc302bec22853291b27

Attempt to use new asm syntax

view details

push time in a month

push eventcesarb/clear_on_drop

Cesar Eduardo Barros

commit sha aa4c96a39d0768471749a41265e9601839e33950

Update minimum Rust to 1.31.1 The cc crate does not work with earlier Rust anymore.

view details

push time in a month

issue commentcesarb/clear_on_drop

Error compiling with latest nightly - asm syntax outdated

That is a strange one. Both asm!() are within a #[cfg(feature = "nightly")], so they should not break unless you set the nightly feature for this crate, but from what I could see you are not doing that. I triggered a CI run of this crate to see if after https://github.com/rust-lang/rust/issues/70173 it breaks even if the asm!() is inside a #[cfg]'ed out block, but as you can see at https://travis-ci.org/github/cesarb/clear_on_drop/builds/693501576 it breaks only when the nightly feature is being used (as expected), and compiles fine without the nightly feature. I have no idea why your build is breaking, unless the dependency is coming from somewhere I didn't look, and that place is setting the nightly feature for this crate.

dunnock

comment created time in a month

push eventcesarb/clear_on_drop

Cesar Eduardo Barros

commit sha b4986e55f1b2b667d8df8c05540c552f953e1f77

Trigger CI rebuild

view details

push time in a month

pull request commentBLAKE3-team/BLAKE3

Vulkan implementation of b3sum

Also, since I'm leaving my informal benchmark results here: with --vulkan, it takes 1,4s independent of the --num-threads; without --vulkan, it takes 0,66s for --num-threads 8, 0,77s for --num-threads 4, 1,4s for --num-threads 2, and 2,6s for --num-threads 1. So the Vulkan code ties with two CPU threads, and wins over a single CPU thread.

cesarb

comment created time in 2 months

pull request commentBLAKE3-team/BLAKE3

Vulkan implementation of b3sum

Yes, it would be helpful. I'm not too hopeful it will end up being faster, but it doesn't hurt to try.

First, you should run the b3sum tests with Vulkan enabled (cargo test --features=vulkan), to make sure I didn't do anything which works only on my integrated GPU; if you have the Vulkan SDK installed, you should also do it with the validation layer enabled (VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation cargo test --features=vulkan; by default, Vulkan doesn't check for invalid uses of its API, that's done by the validation layer).

Then, you can compare the speed of b3sum on a large file which fits in the RAM disk cache (I used CentOS-8.1.1911-x86_64-dvd1.iso), with and without --vulkan. Here, I'm getting around 1,4s with --vulkan, and only 0,66s without --vulkan, on an i5-8250U (4 cores, 8 threads).

cesarb

comment created time in 2 months

pull request commentBLAKE3-team/BLAKE3

Vulkan implementation of b3sum

So to try to make this faster, I made several changes:

  1. I changed from the vulkano crate to the lower-level ash crate, which gave me more control over the exact Vulkan calls being made, without having to fight the higher-level abstractions all the time;
  2. I changed the dispatch loop to completely avoid pipeline barriers, through the use of software pipelining and push constants for the control data;
  3. I made the chunk shader do the input endian conversion, and did the output endian conversion on the CPU (note that this code is still untested, since I have no big-endian machine with a modern GPU);
  4. I changed the dispatch loop to do the final hashing of the parents in parallel with the GPU starting the next unit of work;
  5. I made the Vulkan code also use memmap and copy from the memory-mapped file, instead of using a read system call, which allowed making the dispatch loop much simpler (since it no longer has to read back the tail of the file from the GPU input buffer);
  6. I changed the input buffer from host cached to device local, so that on a discrete GPU, the data is copied from the memory-mapped file through the PCIe bus to the GPU, instead of being copied from the memory-mapped file to a buffer in the CPU memory and then read through the PCIe bus by the GPU (this makes no difference on an integrated GPU, which has a single memory type).

Unfortunately, on my integrated GPU it's still slower than hashing directly on the CPU, except when using a single thread (--num-threads 1). I used VK_KHR_pipeline_executable_properties to peek at the generated shader executable, and saw no obvious issues there (no registers spills, the code looks sane).

But I think I now know the issue. If I comment out the copy_from_slice from the memory-mapped file to the shader input buffer, most of the performance loss goes away (and the result obviously becomes incorrect). It seems that the performance is being limited by the memory bandwidth; since the GPU is only allowed to read from buffers allocated specifically for the GPU, I have to copy from the memory-mapped file to the memory allocated for the GPU, while the CPU can read directly from the memory-mapped file.

That is, it's reading twice from memory (and writing once), instead of just reading once. Things might be better with a discrete GPU, since it would read only once from memory, write through the PCIe bus (which is separate from the CPU memory bus), and then read again on the GPU (which has faster memory, and is also separate from the CPU memory bus); of course, that would depend on how fast the PCIe bus and GPU memory is. Unfortunately, I don't have at the moment a device with a discrete GPU to test and see how it works.

cesarb

comment created time in 2 months

create barnchcesarb/BLAKE3

branch : vulkan-dump-pipelines

created branch time in 2 months

push eventcesarb/BLAKE3

Cesar Eduardo Barros

commit sha e209a22fb1c8faa944208d24d6e613adc69362a5

Use mmap for Vulkan

view details

Cesar Eduardo Barros

commit sha ddacfc6ffa3e3e60022449469734415a46619b27

Use DEVICE_LOCAL memory type for input buffers On a discrete GPU, this should allow writing directly from the mmap to the device through the PCIe bus, instead of writing to a memory buffer and letting the device read it through the PCIe bus. On an integrated GPU, this change has no effect, since there's only one memory type.

view details

push time in 2 months

push eventcesarb/BLAKE3

Cesar Eduardo Barros

commit sha a51087f68930e1b9e1347c580431f7325b97dbc1

Use ash instead of vulkano And rewrite the gpu hashing loop to avoid pipeline barriers.

view details

push time in 2 months

Pull request review commentBLAKE3-team/BLAKE3

implement `Hasher` trait for `Hasher`

 impl Hasher {     } } +impl hash::Hasher for Hasher {+    fn finish(&self) -> u64 {+        let mut bytes = [0; 8];++        self.finalize_xof().fill(&mut bytes);++        u64::from_be_bytes(bytes)

Since BLAKE3 is inherently little-endian, it would make more sense to use u64::from_le_bytes(bytes) here; on most common processors (which are all little-endian), it's also faster (which is why BLAKE3 is little-endian in the first place).

Luro02

comment created time in 2 months

push eventcesarb/BLAKE3

Cesar Eduardo Barros

commit sha 7b69c23135b7a3733701058061b5f363c2f8636f

Simplify shader control

view details

push time in 2 months

push eventcesarb/BLAKE3

Cesar Eduardo Barros

commit sha 860b6f04951dc95631adc2bd5524873c3a3a9e86

Reduce branching in shaders

view details

Cesar Eduardo Barros

commit sha a7f37373fb9735aedf100294bc2593ffc632ab5f

Reduce dependency on vulkano

view details

Cesar Eduardo Barros

commit sha 35312e92bb7ef60f78338c14e0a4ed12c6ac69d3

Simplify shader endian conversion

view details

push time in 2 months

issue commentBLAKE3-team/BLAKE3

Support re-creating Hasher state

I would like to pause and resume the hashing of a long-lived sequence of events that are collected over time by saving a Hasher instance to a database.

Wait, let's go back a step. Pausing and resuming the hashing of a long-lived sequence is nothing more than a special case of incremental hashing (https://github.com/BLAKE3-team/BLAKE3/issues/82). That is, by saving some of the intermediate tree nodes, plus the incomplete last chunk (up to 1023 bytes), one could use the incremental hashing API (still to be defined, currently the undocumented guts module) to extend the hash with new data (and generate the new intermediate nodes/partial chunk to be saved).

Using the incremental hashing mode that way would be much more robust than attempting to serialize and deserialize the internal Hasher state, and would be compatible with every implementation of blake3 which supported incremental hashing, even ones yet to be written.

That3Percent

comment created time in 2 months

push eventcesarb/simpledvr

Cesar Eduardo Barros

commit sha 6d8e7e14c4fc778fd1a977a859fb2b12e0cbac76

Small fixes

view details

push time in 2 months

issue commentBLAKE3-team/BLAKE3

Use of incremental/chunk-based verification

We'd need a new parameter somewhere to say "this isn't the root, don't finalize it," but otherwise all the code is reusable without any changes.

I'd simply add a new finalize variant: besides finalize(&self) -> Hash and finalize_xof(&self) -> OutputReader, I'd add something like finalize_subtree(&self) -> Hash. The only question is how to prevent misuse; as the BLAKE3 paper says, setting the ROOT flag is important, but this function would make it easy to not set that flag. Perhaps doing something like what I did in my experimental Vulkan branch, which is to have an alternative LessSafeHasher and expose that particular finalize_subtree variant only on that hasher? That way, normal users of Hasher would not be tempted to call that variant, only the ones who need the incremental hashing would have to be careful.

17dec

comment created time in 3 months

issue commentBLAKE3-team/BLAKE3

Use of incremental/chunk-based verification

I would recommend against chunk-level (1024-byte) granularity. Having 16-chunk (16384-byte) granularity instead not only divides the space used by the metadata by 16, but also makes sure you can always use the most accelerated SIMD path, even with large 512-bit vectors like AVX-512. If you want to future-proof against 1024-bit vectors, you could even go to 32-chunk (32768-byte) granularity.

Above this, it's a trade-off between the metadata overhead and how much needs to be transferred before being verified. (You might also want bigger blocks if you want to take advantage of multiple threads, for instance a 64-chunk (65536-byte) block could use 4 threads with AVX-512.)

17dec

comment created time in 3 months

issue commentBLAKE3-team/BLAKE3

Use of incremental/chunk-based verification

but I suspect a lot of filesystems do something similar to this internally

Most filesystems don't even care about the hash of the data, only its location on the disk. Those that do care about the hash (as an additional integrity check, the main identifier is still the location on the disk) store the hash of each block/extent/whatever separately in the metadata, and each hash is fully recomputed whenever the corresponding block/extent/whatever changes (there is no incremental update). There is no tree structure of the hashes, only of the locations on the disk (and sometimes not even that; FAT32 and its relatives use a linked list).

17dec

comment created time in 3 months

pull request commentBLAKE3-team/BLAKE3

allow for construction of hash in const fn context

There's already Hash::from([0; 32]), so creating a Hash directly has always been allowed (and I see no reason why it would be discouraged). It seems trait fns cannot be const, but since this is already allowed, I would recommend naming it simply Hash::new(...).

inanna-malick

comment created time in 3 months

more