profile
viewpoint
遗忘

BLAKE3-team/BLAKE3 1900

official implementations of the BLAKE3 cryptographic hash function

BLAKE3-team/BLAKE3-specs 102

The BLAKE3 paper: specifications, analysis, and design rationale

sneves/blake2-avx2 21

BLAKE2 AVX2 implementations

sneves/norx-rust 16

NORX implementation in Rust

sneves/chacha-avx2 14

AVX2 Chacha implementation

sneves/avx512-utils 11

AVX-512 utilities

noloader/BLAKE2 0

BLAKE2 official implementations

noloader/libb2 0

C library providing BLAKE2b, BLAKE2s, BLAKE2bp, BLAKE2sp

issue commentP-H-C/phc-winner-argon2

Issue regarding Variable Hash Output less then or equal to 64 bytes/Characters

the blake_state buffer shows the last 4 bytes of the input buffer and not the outlen as stated by the specification.

This is as would be expected; internally, BLAKE2b keeps a 128-byte input buffer, and every 128 bytes it will compress the full buffer into a new chaining value. So after 1028 bytes, it is entirely expected that the input buffer only contains the last 1028 % 128 = 4 bytes of in.

It would, perhaps, be more useful if you would describe what you're doing.

ithompson1959

comment created time in 9 days

push eventBLAKE2/BLAKE2

Jeffrey Walton

commit sha 087dd6dec0495ff835441cba63787db683a11b52

Add Aarch64 makefile

view details

Samuel Neves

commit sha b52178a376ca85a8ffe50492263c2a5bc0fa4f46

Merge pull request #67 from noloader/aarch64 Add Aarch64 makefile

view details

push time in 13 days

PR merged BLAKE2/BLAKE2

Add Aarch64 makefile

This commits add an Aarch64 makefile.

It is copy/paste of the original makefile, less the ARMv7 arch settings.

+40 -1

0 comment

2 changed files

noloader

pr closed time in 13 days

push eventBLAKE2/libb2

Jeffrey Walton

commit sha 68712f41ceb33b44b025b4e0c01662a81f3bea91

Add Travis testing

view details

Samuel Neves

commit sha a036711671cc3bf35218ee659dfae0ba743ac45d

Merge pull request #35 from noloader/travis Add Travis testing

view details

push time in 13 days

PR merged BLAKE2/libb2

Add Travis testing

This commit adds Travis testing.

A sample test report can be found here.

Commits and merges will trigger a CI build for those people who integrate Travis. If BLAKE2 does not have a Travis account, then the tests will not run. It is no big deal because test will run on forks of libb2.

@sneves, you may want to run a personal fork of libb2 and enable Travis on it. This config makes no changes to BLAKE2/libb2, while enabling CI testing for your fork. I use a similar setup for Crypto++. Wei does not have Travis integration, but my personal testing fork has Travis integration.

+342 -0

0 comment

1 changed file

noloader

pr closed time in 13 days

push eventBLAKE2/BLAKE2

Jeffrey Walton

commit sha a9facd90510213144b76ddacf070da4cfb55c0d6

Add Travis testing

view details

Samuel Neves

commit sha 5587e70c2285dfbda4b74a44c87015b2484e0559

Merge pull request #66 from noloader/travis Add Travis testing

view details

push time in 13 days

PR merged BLAKE2/BLAKE2

Add Travis testing

This commit adds a Travis testing configuration.

Sample testing results can be found here.

The Travis testing includes the following on Ubuntu Linux VMs:

  • GCC, amd64, ref
  • GCC, amd64, sse
  • GCC, aarch64, ref
  • GCC, aarch64, neon
  • GCC, ppc64le, ref
  • GCC, ppc64le, power8
  • GCC, s390x, ref
  • Clang, amd64, ref
  • Clang, amd64, sse
  • Clang, aarch64, ref
  • Clang, aarch64, neon
  • Clang, ppc64le, ref
  • Clang, ppc64le, power8
  • Clang, s390x, ref

Testing also includes the following on OS X 10.13 VMs:

  • Clang, amd64, ref
  • Clang, amd64, sse

Commits and merges will trigger a CI build for those people who integrate Travis. If BLAKE2 does not have a Travis account, then the tests will not run. It is no big deal because test will run on forks of BLAKE2.

@sneves, you may want to run a personal fork of BLAKE2 and enable Travis on it. This config makes no changes to BLAKE2/BLAKE2, while enabling CI testing for your fork. I use a similar setup for Crypto++. Wei does not have Travis integration, but my personal testing fork has Travis integration.

+231 -0

0 comment

1 changed file

noloader

pr closed time in 13 days

push eventBLAKE2/libb2

Samuel Neves

commit sha 186ffdfcde580b609ed46af10553924408bed878

x86_64 always has at least SSE2

view details

push time in 14 days

pull request commentBLAKE2/libb2

Add GCC constructor for get_cpu_features

According to the C standard, an enumeration is the implementation-defined smallest type that can represent all values. Hence, I'm doubtful that any 32-bit architecture will turn cpu_feature_t into a multi-word value. The default is int.

You can make sure of this by doing the following after the declaration of enum cpu_feature_t:

enum { BLAKE2_STATIS_ASSERT_1 = 1 / (sizeof(cpu_feature_t) <= sizeof(int)) };
noloader

comment created time in 15 days

push eventBLAKE2/BLAKE2

Samuel Neves

commit sha d4615d847376dfdd2243d92de9592bdaecc1c5e6

update README.md

view details

push time in 16 days

push eventBLAKE2/libb2

Jeffrey Walton

commit sha c4283b3469f94d6e209b5f22a4cea6ba1f27d8a8

Fix missing xgetbv on downlevel compilers

view details

Samuel Neves

commit sha 37e4b4b80217ee52aa4bd144c1f4a2ad0c8c382d

Merge pull request #33 from noloader/cpuid Fix missing xgetbv on downlevel compilers

view details

push time in 16 days

PR merged BLAKE2/libb2

Fix missing xgetbv on downlevel compilers

Also see http://www.agner.org/optimize/vectorclass/read.php?i=65.

+3 -1

1 comment

1 changed file

noloader

pr closed time in 16 days

pull request commentBLAKE2/libb2

Fix missing xgetbv on downlevel compilers

I doubt that any compiler that wouldn't recognize xgetbv would be able to compile any of the intrinsics code, but there's no harm in doing this.

noloader

comment created time in 16 days

pull request commentBLAKE2/BLAKE2

Add BLAKE2b for POWER8

Alright, merged. Thanks for the code!

noloader

comment created time in 17 days

PR merged BLAKE2/BLAKE2

Add BLAKE2b for POWER8

This commit adds BLAKE2b for POWER8.

The self tests were OK on gcc112.fsffrance.org (Linux, ppc64le) and gcc119.fsffrance.org (AIX, ppc64be) using GCC and XLC compilers.

The POWER8 code is anywhere from 12 cpb faster to 2 cpb slower then C code, depending on the architecture, the compiler and cpu power settings. The BLAKE2b POWER8 code is about 2 to 2.5 cpb faster than MD5 based on the benchmarks I have access to.

The BLAKE2s gear is from the reference implementation. I don't plan on a BLAKE2s Altivec implementation. The Altivec implementation of BLAKE2s is slower than MD5 on all architecture and compilers I have access to. However, Altivec is still faster than C in most cases.


Here are the benchmark numbers for the benchmark program I have access to. All measurements taken at -O3:

gcc119 (AIX, ppc64be):

  • BLAKE2b, C++ - 17.59 cpb
  • BLAKE2b, POWER8 - 5.55 cpb
  • BLAKE2s, C++ - 7.74 cpb
  • BLAKE2s, Altivec - 7.94 cpb

gcc112 (Linux, ppc64le):

  • BLAKE2b, C++ - 10.36 cpb
  • BLAKE2b, POWER8 - 8.29 cpb
  • BLAKE2s, C++ - 20.88 cpb
  • BLAKE2s, Altivec - 16.52 cpb

The cpu power settings are OnDemand so it introduces a lot of variability in measurements. Measurements will swing upto about 6 or 8 cpb. You usually need to run the benchmarks several times, and take the next-to-lowest measurement (treat lowest as an outlier). I'd like to get into Performance mode, but we need sudo access for a script like governor.sh.

+3238 -0

2 comments

14 changed files

noloader

pr closed time in 17 days

push eventBLAKE2/BLAKE2

Jeffrey Walton

commit sha 40527ef47ff50e1b6ba58b40c14b4539f747e4a2

Add BLAKE2b POWER8 implementation

view details

Samuel Neves

commit sha ce5bfc80f896da73e8967431ba1ef1f85cadfce7

rename power8 directory

view details

Samuel Neves

commit sha 8c6526f992396561ec92498a0a7f835877e0e946

Merge pull request #65 from noloader/power8 Add BLAKE2b for POWER8

view details

push time in 17 days

push eventnoloader/BLAKE2

Samuel Neves

commit sha ce5bfc80f896da73e8967431ba1ef1f85cadfce7

rename power8 directory

view details

push time in 17 days

pull request commentBLAKE2/BLAKE2

Add BLAKE2b for POWER8

One thing I just thought of: will this work on anything prior to POWER8? If not, maybe it's misleading to have it in a ppc directory, instead of power8 or something?

noloader

comment created time in 17 days

issue commentBLAKE3-team/BLAKE3

very high ram usage on 133GB file?

This is just how Windows looks like with memory-mapped files. Mapped pages will factor into the working set size, even though they're not "real" memory allocated by the program:

The working set of a process is the set of pages in the virtual address space of the process that are currently resident in physical memory.

So this is not a memory leak or anything, it's just the OS doing as it's expected to. The same happens in the Unices.

divinity76

comment created time in 17 days

pull request commentBLAKE3-team/BLAKE3

C intrinsics: Use function attributes on GCC and Clang

The intrinsics code will work for 32-bit x86, whereas the assembly is x86_64-specific. It's also conceivable that future architectures will not like the choices made on the assembly and the intrinsics code, with compilers tuned for those, will perform better. So it's somewhat useful to keep that code around.

k0001

comment created time in 17 days

issue commentBLAKE3-team/BLAKE3

Warnings during compilation with MAX_SIMD 1

Something like

#if defined(__GNUC__)
#define BLAKE3_ASSUME(cond) do { if(!(cond)) __builtin_unreachable(); } while(0)
#else
#define BLAKE3_ASSUME(cond) 
#endif

...

size_t num_cvs = blake3_compress_subtree_wide(...);
BLAKE3_ASSUME(num_cvs <= MAX_SIMD_DEGREE_OR_2);

will have no runtime overhead.

I've played around a bit with this issue, and what seems to confuse GCC is the recursion in blake3_compress_subtree_wide. Turning the recursion into a loop could conceivably unconfuse GCC, but I don't particularly feel like that's a worthy change.

xnox

comment created time in 17 days

issue commentP-H-C/phc-winner-argon2

Will argon2 detect instruction sets automatically?

There is presently no such mechanism; CPU dispatch decisions are all made at compile time.

guojize

comment created time in 17 days

pull request commentBLAKE2/BLAKE2

Add BLAKE2b for POWER8

You can try to do

git checkout power8
git reset --hard master
git cherry-pick add6578e006a252f0cd72d2f155c7c12529578d3
git push -f

to get the commit situation under control.

noloader

comment created time in 17 days

pull request commentBLAKE3-team/BLAKE3

Neon detection

Maybe I'm misunderstanding what you're saying, but the test program does not iterate through AVX etc. Say, for example, you have an ARM chip with NEON and SVE2 extensions: https://gcc.godbolt.org/z/J8i6MV

The program will iterate only through the combination of features NONE, NEON|NONE, SVE2|NONE, SVE|NEON|NONE, and ignore all other values (here NONE = 0 = portable code).

xnox

comment created time in 18 days

pull request commentBLAKE3-team/BLAKE3

Neon detection

I would prefer feature bits not to overlap, even across separate architectures, but that's your call.

c/blake3_c_rust_bindings/src/lib.rs needs a definition of neon_detected.

xnox

comment created time in 19 days

pull request commentBLAKE3-team/BLAKE3

Neon detection

On the C side, this seems to work OK on Linux. I made the following change to cycle though reference and NEON code when testing:

diff --git a/c/blake3_impl.h b/c/blake3_impl.h
index 98272d4..223fffd 100644
--- a/c/blake3_impl.h
+++ b/c/blake3_impl.h
@@ -73,8 +73,8 @@ enum cpu_feature {
   AVX512F = 1 << 5,
   AVX512VL = 1 << 6,
 #endif
-#if defined(IS_ARMHF)
-  NEON = 1 << 0,
+#if defined(IS_ARM)
+  NEON = 1 << 7,
 #endif
   /* ... */
   UNDEFINED = 1 << 30

I can't comment on the Rust side.

xnox

comment created time in 19 days

push eventBLAKE3-team/BLAKE3

Samuel Neves

commit sha a3ec6c1ccfe613cca886f6bff5feb0ec9c3710d9

enable CET on asm

view details

Samuel Neves

commit sha f2005678f84a8222be69c54c3d5457c6c40e87d2

Merge pull request #96 from BLAKE3-team/cet Assembly: enable CET

view details

push time in 19 days

PR merged BLAKE3-team/BLAKE3

Assembly: enable CET

Addresses #95.

With our current dispatcher, both on C and Rust, having endbr64 at the top of the functions is rather pointless since these will never be at the end of an indirect call. But since this is a glorified nop, might as well.

+25 -1

8 comments

3 changed files

sneves

pr closed time in 19 days

pull request commentBLAKE3-team/BLAKE3

Assembly: enable CET

Most of CET fixes have been backported to LLVM 10.x, except for <cet.h>. You can do

#if defined(__ELF__) && defined(__CET__) && __has_include(<cet.h>)
# include <cet.h>
#else
...
#endif

Goes in as suggested.

sneves

comment created time in 19 days

push eventBLAKE3-team/BLAKE3

Samuel Neves

commit sha a3ec6c1ccfe613cca886f6bff5feb0ec9c3710d9

enable CET on asm

view details

push time in 19 days

push eventBLAKE3-team/BLAKE3

Samuel Neves

commit sha 881ffe398faef8c9b9dfb47738cca5e551d24b40

put endbr64 behind a macro

view details

push time in 19 days

pull request commentBLAKE3-team/BLAKE3

Assembly: enable CET

The reason I did not just include the cet.h header is that it appears to be GCC-specific; building with Clang would fail, even though it supports -fcf-protection=full.

sneves

comment created time in 19 days

issue commentBLAKE3-team/BLAKE3

Warnings during compilation with MAX_SIMD 1

According to the error messages out_array is uint8_t[32]. But your PR #91 makes MAX_SIMD_DEGREE unconditionally 4 on ARM, so shouldn't MAX_SIMD_DEGREE_OR_2 be 4 instead of 2, and that array be uint8_t[64]?

Either way, this warning happens when MAX_SIMD_DEGREE is 1 or 2, and I believe it's a false positive. When MAX_SIMD_DEGREE is that low, the function is never called:

while (num_cvs > 2) {
    num_cvs =
        compress_parents_parallel(cv_array, num_cvs, key, flags, out_array);
    memcpy(cv_array, out_array, num_cvs * BLAKE3_OUT_LEN);
  }
xnox

comment created time in 20 days

PR opened BLAKE3-team/BLAKE3

Assembly: enable CET

Addresses #95.

With our current dispatcher, both on C and Rust, having endbr64 at the top of the functions is rather pointless since these will never be at the end of an indirect call. But since this is a glorified nop, might as well.

+64 -0

0 comment

3 changed files

pr created time in 20 days

create barnchBLAKE3-team/BLAKE3

branch : cet

created branch time in 20 days

pull request commentBLAKE2/BLAKE2

Add missing SSE includes

I remain confused. blake2b-load-sse41.h also uses SSE2 (and beyond) instrinsics, and there's no #include added. Why only those two files?

<emmintrin.h>, and others, are already included in blake2{s,b}.c, which is the only place these headers will be included from.

noloader

comment created time in 21 days

issue closedBLAKE2/BLAKE2

Any plan to make a new release?

Hi,

Is there any plan to make a new release, e.g. 20190724? Thanks.

closed time in 21 days

sunpoet

issue commentBLAKE2/BLAKE2

Any plan to make a new release?

Tagged current master as 20190724.

sunpoet

comment created time in 21 days

created tagBLAKE2/BLAKE2

tag20190724

BLAKE2 official implementations

created time in 21 days

pull request commentBLAKE2/BLAKE2

Add missing SSE includes

I don't quite understand this one. There are many *-load-*.h files, but only this one includes the headers? Why?

Either way, these are internal headers, meant to be included from a place where the necessary headers are already included.

noloader

comment created time in 21 days

pull request commentBLAKE3-team/BLAKE3

c: Implement ifunc based dispatcher

#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstdio>

#include <vector>
#include <algorithm>

#include <immintrin.h>

extern "C" {
	void blake3_compress_in_place(uint32_t cv[8], const uint8_t block[64], uint8_t block_len, uint64_t counter, uint8_t flags);
	void blake3_compress_in_place_sse41(uint32_t cv[8], const uint8_t block[64], uint8_t block_len, uint64_t counter, uint8_t flags);
}

namespace {

	static inline uint64_t rdtsc() {
	  _mm_lfence();
	#if defined(__clang__)
	  return __builtin_readcyclecounter();
	#else
	  uint32_t hi, lo = 0;
	  __asm__ __volatile__("rdtscp" : "=a"(lo), "=d"(hi)::"rcx");
	  return lo | (((uint64_t)hi) << 32);
	#endif
	}

	template<typename Function>
	static inline uint64_t bench(Function F, size_t const iters) {
		const uint8_t block[64] = {};
		uint32_t cv[8] = {};
		std::vector<uint64_t> cycles(iters + 1);
		for(auto& c : cycles) {
			c = rdtsc();
			F(cv, block, 64, 123, 0x99);
			 __asm__ __volatile__("" : : "m"(cv), "m"(block) : "memory");
		}
		for (size_t i = 0; i < iters; ++i) {
	    cycles[i] = cycles[i + 1] - cycles[i];
	  }
	  cycles.pop_back();
	  std::sort(begin(cycles), end(cycles));
	  return cycles[iters / 2];
	}

}

int main() {
	const size_t kIters = 1UL << 24;
	printf("blake3_compress_in_place: %lu\n", bench([](auto...args) { blake3_compress_in_place(args...); }, kIters));
	printf("blake3_compress_in_place_sse41: %lu\n", bench([](auto...args) { blake3_compress_in_place_sse41(args...); }, kIters));
	return 0;
}
xnox

comment created time in 25 days

pull request commentBLAKE3-team/BLAKE3

c: Implement ifunc based dispatcher

In the ifunc dispatcher I used builtins for CPU features detection. I wonder if I can bring that over to the regular dispatcher, instead of using hand coded cpuid calls. Or at least use cpuid.h on Linux there.

We don't target GCC/Linux exclusively. Using builtins just means we'd have to maintain the builtin version and the other version.

I considered at the time using function pointers instead of regular branches for dispatching, as I had done before on BLAKE2. One, perhaps irrelevant, scenario where it matters is if you enable Spectre retpoline mitigations (e.g. -mretpoline, -mindirect-branch=thunk); then your function pointer overhead spikes by ~70 cycles, whereas the current dispatch method remains more or less constant. But this doesn't matter when using ifuncs, since they're implemented as relocations.

Anyway, back to overhead. I'm not sure there's a measurable benefit. I wrote a specific benchmark to measure the overhead of a single blake3_compress_in_place call, which would be the most affected by this. The results, on a Skylake chip and the median of 2^24 measurements, are thus:

Current dispatching:

blake3_compress_in_place: 240 cycles/compression
blake3_compress_in_place_sse41: 234 cycles/compression

ifunc dispatching:

blake3_compress_in_place: 240 cycles/compression
blake3_compress_in_place_sse41: 234 cycles/compression

So performance-wise, it doesn't seem to make much of a difference. There's around 6 cycles of overhead relative to calling the function directly.

However, if we enable LTO, the detection code of blake3_compress_in_place can get inlined into the loop and the function call overhead is reduced:

blake3_compress_in_place: 238 cycles/compression

This is not the case for ifunc, which always remains at 240.

In any case, since this is mostly a very contained feature, I don't have much of a problem with including the ifunc dispatcher and letting users choose.

xnox

comment created time in 25 days

pull request commentBLAKE3-team/BLAKE3

Neon detection

The way we iterate though features in main.c will only cycle through available features:

const int mask = get_cpu_features();
int feature = 0;
do {
  ...
  feature = (feature - mask) & mask;
} while(feature != 0);

So, in ARM it will be only 0 and potentially NEON.

xnox

comment created time in a month

issue commentBLAKE3-team/BLAKE3

Enhancement: b3sum add -c --check flag for verification

The MSVC C runtime is already perfectly able to handle globbing on its own, however this is not enabled by default. By linking against setargv.obj, or by defining

extern "C" _crt_argv_mode __CRTDECL _get_startup_argv_mode()
{
    return _crt_argv_unexpanded_arguments;
}

oneself, globbing becomes transparently enabled.

So with some special linker incantations, it should be possible to enable globbing on Windows without any extra code.

tERyceNzAchE

comment created time in 2 months

pull request commentBLAKE3-team/BLAKE3

C intrinsics: Use function attributes on GCC and Clang

As far as I could tell, these are supported in all reasonable versions of gcc and clang.

Perhaps a bit less noisy version would be something like

#include <immintrin.h>

#if defined(__clang__)
#pragma clang attribute push (__attribute__((target("avx2"))), apply_to=function)
#elif defined(__GNUC__)
#pragma GCC target("avx2")
#endif

... code here ...

#if defined(__clang__)
#pragma clang attribute pop
#endif

But this loses support for Clang <= 4.x. More specifically, versions 3.8.x, 3.9.x. and 4.x. Not sure if this is an important case to care about.

k0001

comment created time in 2 months

issue commentBLAKE3-team/BLAKE3

Enhancement: b3sum add -c --check flag for verification

To be clear, * indicates binary mode.

tERyceNzAchE

comment created time in 2 months

issue commentBLAKE3-team/BLAKE3

Enhancement: b3sum add -c --check flag for verification

putchar (file_is_binary ? '*' : ' ');

Indicates whether the file is read in fopen(name, "rb") or fopen(name, "r")

tERyceNzAchE

comment created time in 2 months

issue commentBLAKE3-team/BLAKE3

Enhancement: b3sum add -c --check flag for verification

Backslashes are escaped on Windows:

C:\Users\John>md5sum.exe \\.\C:\Users\John\Desktop\b3sum_windows_x64_bin.exe
\1d54334dd99612a139fbb51736290e85 *\\\\.\\C:\\Users\\John\\Desktop\\b3sum_windows_x64_bin.exe
tERyceNzAchE

comment created time in 2 months

pull request commentBLAKE2/libb2

Fix build with distcc

Thanks. I do remember seeing this warning repeatedly, but didn't know what the issue was.

DarthGandalf

comment created time in 3 months

push eventBLAKE2/libb2

Alexey Sokolov

commit sha 0a706db5d73777c605fbfdcdbe72892a3dc36c45

Fix build with distcc It also fixes a warning when building without distcc: armv7a-unknown-linux-gnueabihf-gcc: warning: ../src/: linker input file unused because linking not done https://bugs.gentoo.org/704044

view details

Samuel Neves

commit sha 0f7c898301265c8e29db306f62701918dd88db2f

Merge pull request #31 from DarthGandalf/patch-1 Fix build with distcc

view details

push time in 3 months

PR merged BLAKE2/libb2

Fix build with distcc

It also fixes a warning when building without distcc:

armv7a-unknown-linux-gnueabihf-gcc: warning: ../src/: linker input file unused because linking not done

https://bugs.gentoo.org/704044

+1 -2

0 comment

1 changed file

DarthGandalf

pr closed time in 3 months

more