profile
viewpoint
James Macdonald jamyspex Golspie, UK

jamyspex/4od-ad-block 0

Google chrome extension to remove advertisements from 4oD streaming service.

jamyspex/ale 0

Check syntax in Vim asynchronously and fix files, with Language Server Protocol (LSP) support

jamyspex/MPI-LES 0

Repository for MPI LES

jamyspex/OctoPack 0

Creates Octopus-compatible NuGet packages

push eventLombiq/Hastlayer-SDK

Dávid El-Saig

commit sha a407d6a75f665ad8f7ec0f662d913de2e9513f92

Fix CA1305.

view details

Dávid El-Saig

commit sha 6d7eefd7c2bd37f757fb148674607fefc44f07ce

Fix CA1307.

view details

push time in 3 hours

push eventLombiq/Hastlayer-SDK

Dávid El-Saig

commit sha 6278b5d03fbc00fca4aa858e42844d6163359309

Fix SA1507.

view details

Dávid El-Saig

commit sha a31ed64fdfd48f5970d22378d0b20cda1a986ff5

Fix SA1513.

view details

Dávid El-Saig

commit sha 2742a2fe10cfaaf8d0f8723adf631992ac10bc7a

Fix SA1623.

view details

Dávid El-Saig

commit sha ac12604c4a94e32ca9f90330ab28337754c1c01f

Fix SA1629.

view details

Dávid El-Saig

commit sha f780622b7270944eb96c0dbd83febb252ed8bcb6

Fix SA1629.

view details

push time in 3 hours

push eventLombiq/Hastlayer-SDK

Dávid El-Saig

commit sha 0da70eac695f6b3f6bc041c53daa60ecf38ea882

Fix CA1018.

view details

Dávid El-Saig

commit sha 84820868383d0fd5562d1f709309b42733a479f4

Fix CA1024.

view details

Dávid El-Saig

commit sha feeacb01b8d78adc7c9cef9d94d4fb57d3f42fa7

Disable CA1028.

view details

Dávid El-Saig

commit sha 547b16fd67a1a4007024105f8b07893b66a32b45

Fix exceptions.

view details

Dávid El-Saig

commit sha 238c60334a9569fe8f908b05edc9f334795ede6e

Use var instead of explicit type.

view details

Dávid El-Saig

commit sha e1f26d96f91467168bd86378f97994a88c40100f

Fix CA1052.

view details

Dávid El-Saig

commit sha 83145148f7ded80663ecc89792b1d97ff2a7d250

Fix CA1063.

view details

Dávid El-Saig

commit sha c8618851b27f7fe964dc524ca28ae0679a599491

Fix CA1065.

view details

Dávid El-Saig

commit sha df4d4418a8e29790babc141ec4dd46c43f029bc7

Fix SA1507.

view details

Dávid El-Saig

commit sha 633cae27460f45deac6c0c6a47c241162ce28b4d

Fix various rare (1-2 instance) warnings.

view details

push time in 5 hours

push eventLombiq/Hastlayer-SDK

Dávid El-Saig

commit sha 836bd823d3bcfe5e21faa80d9818ab9650b6889c

Fix CA1008.

view details

push time in 7 hours

create barnchLombiq/Hastlayer-SDK

branch : issue/HAST-164

created branch time in 7 hours

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here you can find some measurements of execution times of various algorithms on 
 ### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).
+
+| Device     | Algorithm                         | Speed advantage | Power advantage | Parallelism | CPU       | CPU power | FPGA utilization | Net FPGA | Total FPGA | FPGA power | FPGA on-chip power |
+|------------|-----------------------------------|-----------------|-----------------|-------------|-----------|-----------|------------------|----------|------------|------------|--------------------|
+| Alveo U200 | ImageContrastModifier<sup>1</sup> | 735%            | 3047%           | 150         | 543 ms    | 49 Ws     | 27.23%           | 12 ms    | 65 ms      | 1.55 Ws    | 23.89 W            |
+| Alveo U200 | ImageContrastModifier<sup>2</sup> | 3448%           | 13266%          | 150         | 198172 ms | 17835 Ws  | 27.23%           | 5340 ms  | 5586 ms    | 133.44 Ws  | 23.89 W            |
+| Alveo U200 | ParallelAlgorithm                 | 171%            | 1464%           | 300         | 379 ms    | 34 Ws     | 15.56%           | 110 ms   | 140 ms     | 2.18 Ws    | 15.58 W            |
+| Alveo U200 | MonteCarloPiEstimator             | 203%            | 1333%           | 230         | 203 ms    | 18 Ws     | 18.57%           | 17 ms    | 67 ms      | 1.28 Ws    | 19.04 W            |
+| Alveo U250 | ImageContrastModifier<sup>1</sup> | 1503%           | 5621%           | 150         | 529 ms    | 48 Ws     | 18.29%           | 13 ms    | 33 ms      | 0.83 Ws    | 25.22 W            |
+| Alveo U250 | ImageContrastModifier<sup>2</sup> | 3268%           | 11921%          | 150         | 193158 ms | 17384 Ws  | 18.29%           | 5535 ms  | 5735 ms    | 144.61 Ws  | 25.22 W            |
+| Alveo U250 | ParallelAlgorithm                 | 357%            | 2437%           | 300         | 498 ms    | 45 Ws     | 10.30%           | 101 ms   | 109 ms     | 1.77 Ws    | 16.21 W            |
+| Alveo U250 | MonteCarloPiEstimator             | 369%            | 2022%           | 230         | 197 ms    | 18 Ws     | 12.39%           | 21 ms    | 42 ms      | 0.84 Ws    | 19.89 W            |
+| Alveo U280 | ImageContrastModifier<sup>1</sup> | 1591%           | 6505%           | 150         | 541 ms    | 49 Ws     | 21.44%           | 12 ms    | 32 ms      | 0.74 Ws    | 23.04 W            |
+| Alveo U280 | ImageContrastModifier<sup>3</sup> | 3414%           | 13629%          | 150         | 17359 ms  | 1562 Ws   | 21.44%           | 459 ms   | 494 ms     | 11.38 Ws   | 23.04 W            |
+| Alveo U280 | ParallelAlgorithm                 | 226%            | 1858%           | 300         | 362 ms    | 33 Ws     | 10.86%           | 102 ms   | 111 ms     | 1.66 Ws    | 14.99 W            |
+| Alveo U280 | MonteCarloPiEstimator             | 387%            | 2397%           | 230         | 185 ms    | 17 Ws     | 13.63%           | 16 ms    | 38 ms      | 0.67 Ws    | 17.55 W            |
+| Alveo U50  | ImageContrastModifier<sup>1</sup> | 1324%           | 6359%           | 150         | 470 ms    | 42 Ws     | 32.09%           | 12 ms    | 33 ms      | 0.65 Ws    | 19.85 W            |
+| Alveo U50  | ImageContrastModifier<sup>4</sup> | 3462%           | 16052%          | 150         | 17167 ms  | 1545 Ws   | 32.09%           | 450 ms   | 482 ms     | 9.57 Ws    | 19.85 W            |
+| Alveo U50  | ParallelAlgorithm                 | 258%            | 2653%           | 300         | 379 ms    | 34 Ws     | 16.22%           | 104 ms   | 106 ms     | 1.24 Ws    | 11.69 W            |
+| Alveo U50  | MonteCarloPiEstimator             | 348%            | 2693%           | 230         | 197 ms    | 18 Ws     | 20.37%           | 18 ms    | 44 ms      | 0.63 Ws    | 14.43 W            |
+
+1. Using the default 0.2MP image `fpga.jpg`.
+2. Using the larger [73.2MP image](https://photographingspace.com/wp-content/uploads/2019/10/2019JulyLunarEclipse-Moon0655-CorySchmitz-PI2_wm-web.jpg).
+3. Using the scaled down [6.4MP image](https://photographingspace.com/wp-content/uploads/2019/10/2019JulyLunarEclipse-Moon0655-CorySchmitz-PI2_wm-web50pct-square-scaled.jpg) because the testing binary was built for the High Bandwidth Memory. Currently only one HBM slot is supported, meaning that the available memory without disabling HBM is 256MB.

Ok. Mindenesetre wikire is feltöltöttem.

DAud-IcI

comment created time in 4 days

push eventLombiq/Hastlayer-SDK

Dávid El-Saig

commit sha 9132300eff4a3df192b6a6ce2e75f0c3f01af917

Benchmarks.

view details

Dávid El-Saig

commit sha bcfa1acc46c9a071934eed61e48168a6d4995c8c

It's enough when the bin/hash.xclbin and matching info files exist.

view details

Dávid El-Saig

commit sha ac620bfeb6a8c6c3d21276d79aa723997ae5f2be

Added notes on cross compilation.

view details

Dávid El-Saig

commit sha 93999e5b313ee496691bd5a7c4fdadaaea6bce62

Clarification.

view details

Dávid El-Saig

commit sha 4b3b35a27fce899cb841524deab5f402f1e8f259

Fix incorrect report retrieval.

view details

Dávid El-Saig

commit sha 2bfb883d3974d0ed03fc1b267842f6a7dd897be4

Added warning about memory.

view details

Dávid El-Saig

commit sha 3f5dc69270588a17ee264aae4514b34d83bd32e1

Added some benchmarks.

view details

Dávid El-Saig

commit sha e216aeb41b8f7f71a0e60d782bb3f37e2fa56b08

U280 speed stats.

view details

Dávid El-Saig

commit sha 8f1bbdd28225ed300142d211fe086cde0c1653f6

U250 remaining speed stats.

view details

Dávid El-Saig

commit sha 7fe7f2de09d96246736173a27611c26828564805

More documentation.

view details

Dávid El-Saig

commit sha 5d57f52a764ed67405f6ea1058b3d353659bfebd

Added the ability to disable HBM.

view details

Dávid El-Saig

commit sha 0f85d637544c7e22d1186935d7266b5cb30688f4

Updated U200 speeds.

view details

Dávid El-Saig

commit sha 7af612e4b355aba691c34756afc059ac667d0cc0

Updated FPGA on-chip power.

view details

Dávid El-Saig

commit sha a5c8d652586ed754d85ab24a9b7051ceae7aa581

Filled out values for all except U50 ParallelAlgorithm.

view details

Dávid El-Saig

commit sha 585d62123c3d7fc1e7d3f2652346d96587d6d0e5

Added note on VS Code extension.

view details

Dávid El-Saig

commit sha ebbb06ccf0fc62da283b3c67619eabed034b1c3a

Filled in the last benchmark row and added some decimals to the power columns.

view details

Dávid El-Saig

commit sha 263d862e733e68dcfb17a580069f8333dd6e4a78

Add Frequency and DataSize constants for manifest readability.

view details

Dávid El-Saig

commit sha 0c78faead10923e9fec1212e38121315f77720ea

Also added binary prefix data sizes.

view details

Dávid El-Saig

commit sha 239f84918f115e71c21ef168003ed01670a1f22a

Merge remote-tracking branch 'origin/dev' into issue/HAST-159

view details

Dávid El-Saig

commit sha 9859e3a2df90645f6fca1bcb18ff191f72c57423

Remove unused property.

view details

push time in 4 days

PR merged Lombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

HAST-159

+385 -71

3 comments

26 changed files

DAud-IcI

pr closed time in 4 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here you can find some measurements of execution times of various algorithms on 
 ### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).
+
+| Device     | Algorithm                         | Speed advantage | Power advantage | Parallelism | CPU       | CPU power | FPGA utilization | Net FPGA | Total FPGA | FPGA power | FPGA on-chip power |
+|------------|-----------------------------------|-----------------|-----------------|-------------|-----------|-----------|------------------|----------|------------|------------|--------------------|
+| Alveo U200 | ImageContrastModifier<sup>1</sup> | 735%            | 3047%           | 150         | 543 ms    | 49 Ws     | 27.23%           | 12 ms    | 65 ms      | 1.55 Ws    | 23.89 W            |
+| Alveo U200 | ImageContrastModifier<sup>2</sup> | 3448%           | 13266%          | 150         | 198172 ms | 17835 Ws  | 27.23%           | 5340 ms  | 5586 ms    | 133.44 Ws  | 23.89 W            |
+| Alveo U200 | ParallelAlgorithm                 | 171%            | 1464%           | 300         | 379 ms    | 34 Ws     | 15.56%           | 110 ms   | 140 ms     | 2.18 Ws    | 15.58 W            |
+| Alveo U200 | MonteCarloPiEstimator             | 203%            | 1333%           | 230         | 203 ms    | 18 Ws     | 18.57%           | 17 ms    | 67 ms      | 1.28 Ws    | 19.04 W            |
+| Alveo U250 | ImageContrastModifier<sup>1</sup> | 1503%           | 5621%           | 150         | 529 ms    | 48 Ws     | 18.29%           | 13 ms    | 33 ms      | 0.83 Ws    | 25.22 W            |
+| Alveo U250 | ImageContrastModifier<sup>2</sup> | 3268%           | 11921%          | 150         | 193158 ms | 17384 Ws  | 18.29%           | 5535 ms  | 5735 ms    | 144.61 Ws  | 25.22 W            |
+| Alveo U250 | ParallelAlgorithm                 | 357%            | 2437%           | 300         | 498 ms    | 45 Ws     | 10.30%           | 101 ms   | 109 ms     | 1.77 Ws    | 16.21 W            |
+| Alveo U250 | MonteCarloPiEstimator             | 369%            | 2022%           | 230         | 197 ms    | 18 Ws     | 12.39%           | 21 ms    | 42 ms      | 0.84 Ws    | 19.89 W            |
+| Alveo U280 | ImageContrastModifier<sup>1</sup> | 1591%           | 6505%           | 150         | 541 ms    | 49 Ws     | 21.44%           | 12 ms    | 32 ms      | 0.74 Ws    | 23.04 W            |
+| Alveo U280 | ImageContrastModifier<sup>3</sup> | 3414%           | 13629%          | 150         | 17359 ms  | 1562 Ws   | 21.44%           | 459 ms   | 494 ms     | 11.38 Ws   | 23.04 W            |
+| Alveo U280 | ParallelAlgorithm                 | 226%            | 1858%           | 300         | 362 ms    | 33 Ws     | 10.86%           | 102 ms   | 111 ms     | 1.66 Ws    | 14.99 W            |
+| Alveo U280 | MonteCarloPiEstimator             | 387%            | 2397%           | 230         | 185 ms    | 17 Ws     | 13.63%           | 16 ms    | 38 ms      | 0.67 Ws    | 17.55 W            |
+| Alveo U50  | ImageContrastModifier<sup>1</sup> | 1324%           | 6359%           | 150         | 470 ms    | 42 Ws     | 32.09%           | 12 ms    | 33 ms      | 0.65 Ws    | 19.85 W            |
+| Alveo U50  | ImageContrastModifier<sup>4</sup> | 3462%           | 16052%          | 150         | 17167 ms  | 1545 Ws   | 32.09%           | 450 ms   | 482 ms     | 9.57 Ws    | 19.85 W            |
+| Alveo U50  | ParallelAlgorithm                 | 258%            | 2653%           | 300         | 379 ms    | 34 Ws     | 16.22%           | 104 ms   | 106 ms     | 1.24 Ws    | 11.69 W            |
+| Alveo U50  | MonteCarloPiEstimator             | 348%            | 2693%           | 230         | 197 ms    | 18 Ws     | 20.37%           | 18 ms    | 44 ms      | 0.63 Ws    | 14.43 W            |
+
+1. Using the default 0.2MP image `fpga.jpg`.
+2. Using the larger [73.2MP image](https://photographingspace.com/wp-content/uploads/2019/10/2019JulyLunarEclipse-Moon0655-CorySchmitz-PI2_wm-web.jpg).
+3. Using the scaled down [6.4MP image](https://photographingspace.com/wp-content/uploads/2019/10/2019JulyLunarEclipse-Moon0655-CorySchmitz-PI2_wm-web50pct-square-scaled.jpg) because the testing binary was built for the High Bandwidth Memory. Currently only one HBM slot is supported, meaning that the available memory without disabling HBM is 256MB.

Majd szólj, ha jelentkezett és devbe pushold, ha engedi légyszi.

DAud-IcI

comment created time in 4 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here you can find some measurements of execution times of various algorithms on 
 ### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).

OK, egyelőre hagyjuk.

DAud-IcI

comment created time in 4 days

push eventLombiq/Hastlayer-SDK

Zoltán Lehóczky

commit sha 8c56839cf62bdabcf3fe3deac0af0cc5b371d9ea

Adding progress message when no Vitis build is done

view details

push time in 4 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here you can find some measurements of execution times of various algorithms on 
 ### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).

Legalább is én úgy tudom. Aztán lehet itt is van valami alap dolog amiről nem tudunk, mint ez a 2019.2 vs 2020.1 build ami most derült ki. :/

DAud-IcI

comment created time in 4 days

push eventLombiq/Hastlayer-SDK

Zoltán Lehóczky

commit sha 776d294be6b2c2c47d085cdaf832677d6a011d5d

Adding note on the Vivado version used for Nexys tests

view details

push time in 4 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here are some basic performance benchmarks on how Hastlayer-accelerated code compares to standard .NET. Since with FPGAs you're not running a program on a processor like a CPU or GPU but rather you create a processor out of your algorithm direct comparisons are hard. Nevertheless, here we tried to compare FPGAs and host PCs (CPUs) with roughly on the same level (e.g. comparing a mid-tier CPU to a mid-tier FPGA). All the algorithms are samples in the Hastlayer solution and available for you to check.
 
 
-## Notes on the hardware used
-
-- "Vitis": [Xilinx Vitis Unified Software Platform](https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html) cards were used (eg. [Alveo U280 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u280.html)).
-- "Catapult": [Microsoft Project Catapult](https://www.microsoft.com/en-us/research/project/project-catapult/) servers used via the [Project Catapult Academic Program](https://www.microsoft.com/en-us/research/academic-program/project-catapult-academic-program/). These contain the following hardware:
-    - FPGA: Mt Granite card with an Altera Stratix V 5SGSMD5H2F35 FPGA and two channels of 4 GB DDR3 RAM, connected to the host via PCIe Gen3 x8. Main clock is 150 Mhz, power consumption is at most 29 W (source: "[A Cloud-Scale Acceleration Architecture](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/Cloud-Scale-Acceleration-Architecture.pdf)").
-    - Host PC: 2 x Intel Xeon E5-2450 CPUs with 16 physical, 32 logical cores each, with a base clock of 2.1 GHz. Power consumption is around 95 W under load (based on [the processor's TDP](https://ark.intel.com/content/www/us/en/ark/products/64611/intel-xeon-processor-e5-2450-20m-cache-2-10-ghz-8-00-gt-s-intel-qpi.html); this is just a rough number and power draw is likely larger when the CPU increases its clock speed under load)
-- "i7 CPU": Intel Core i7-960 CPU with 4 physical, 8 logical cores and a base clock of 3.2 Ghz. Power consumption is around 130 W under load (based on [the processor's TDP](https://ark.intel.com/content/www/us/en/ark/products/37151/intel-core-i7-960-processor-8m-cache-3-20-ghz-4-80-gt-s-intel-qpi.html)).
-- "Nexys": [Nexys A7-100T FPGA board](https://store.digilentinc.com/nexys-a7-fpga-trainer-board-recommended-for-ece-curriculum/) with a Xilinx XC7A100T-1CSG324C FPGA of the Artix-7 family, with 110 MB of user-accessible DDR2 RAM. Main clock is 100 Mhz, power consumption is at most about 2.5 W (corresponding to the maximal power draw via a USB 2.0 port). The communication channel used was the serial one: Virtual serial port via USB 2.0 with a baud rate of 230400 b/s.
-
-
-## Measurements
+## Notes on measurements
 
 Here you can find some measurements of execution times of various algorithms on different platforms. Note:
 
 - Measurements were made with binaries built in Release mode.
+- Measurements were always taken from the second or third run of the application to disregard initialization time. 
 - Figures are rounded to the nearest integer.
 - Speed and power advantage means the execution time and power consumption advantage of the Hastlayer-accelerated FPGA implementation. E.g. a 100% speed advantage means that the Hastlayer-accelerated implementation took half the time to finish than the original CPU one.
 - Degree of parallelism indicates the level of parallelization in the algorithm (typically the number of concurrent `Task`s). For details on these check out the source of the respective algorithm (these are all classes in the Hastlayer SDK's solution). Note that the degree of parallelism, as indicated, differs between platforms. On every platform the highest possible parallelism was used (so CPUs were under full load and thus peak power consumption can be reasonably assumed).
 - For CPU execution always the lowest achieved number is used to disregard noise. So this is an optimistic approach and in real life the CPU executions will most possibly be slower.
-- Power consumption is an approximation based on hardware details above. For PCs it only contains the power consumption of the CPU(s). For FPGA measurements the "total" time is used (though presumably when just communication is running the power consumption is much lower than when computations are being executed).
+- Power consumption is an approximation based on hardware details. For PCs it only contains the power consumption of the CPU(s). For FPGA measurements the "total" time is used (though presumably when just communication is running the power consumption is much lower than when computations are being executed).
 - FPGA resource utilization figures are based on the "main" resource's utilization with all other resource types assumed to be below 100%. For Xilinx FPGAs the main resource type is LUT, for Intel (Altera) ones ALM.
 - For FPGA measurements "total" means the total execution time, including the communication latency of the FPGA; since this varies because of the host PC's load the lowest achieved number is used. "Net" means just the execution of the algorithm itself on the FPGA, not including the time it took to send data to and receive from the device; FPGA execution time is deterministic and doesn't vary significantly. With faster communication channels "total" can be closer to "net". If the input and output data is small then the two measurements will practically be the same.
 
-### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+## Vitis
 
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280/250/200/50) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) instance.
 
-| Algorithm             | Speed advantage | Power advantage |   Parallelism  |   CPU  | CPU power | FPGA utilization | Net FPGA | Total FPGA | FPGA power |
-|:----------------------|:---------------:|:---------------:|:--------------:|:------:|:---------:|:----------------:|:--------:|:----------:|:----------:|
-| ImageContrastModifier |       568%      |       ???%      |        25      | 568 ms |   ?? Ws   |        ??%       |   28 ms  |    85 ms   |   ?? Ws    |
-| ImageContrastModifier |       620%      |       ???%      | 150<sup>1</sup>| 568 ms |   ?? Ws   |        ??%       |   24 ms  |    79 ms   |   ?? Ws    |
+### Details
+
+- FPGA: The following [Xilinx Vitis Unified Software Platform](https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html) cards were used:
+  - [Alveo U280 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u280.html), PCI Express® Gen3 x16, 225 W Maximum Total Power
+  - [Alveo U250 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u250.html), PCI Express® Gen3 x16, 225 W Maximum Total Power
+  - [Alveo U200 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u200.html), PCI Express® Gen3 x16, 225 W Maximum Total Power
+  - [Alveo U50 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u50.html), PCI Express® Gen3 x16, 75 W Maximum Total Power
+- Host: A [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance with 16 x Intel Xeon E5-2640 v3 CPUs with 8 physical, 16 logical cores each, with a base clock of 2.6 GHz. Power consumption is around 90 W under load (based on the processor's TDP, [see here](https://ark.intel.com/content/www/us/en/ark/products/83359/intel-xeon-processor-e5-2640-v3-20m-cache-2-60-ghz.html); the power draw is likely larger when the CPU increases its clock speed under load).
+- Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](Attachments/BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).
+
+### Measurements
+
+| Device     | Algorithm                         | Speed advantage | Power advantage | Parallelism | CPU       | CPU power | FPGA utilization | Net FPGA | Total FPGA | FPGA power | FPGA on-chip power |
+|------------|-----------------------------------|-----------------|-----------------|-------------|-----------|-----------|------------------|----------|------------|------------|--------------------|
+| Alveo U280 | ImageContrastModifier<sup>1</sup> | 1591%           | 6505%           | 150         | 541 ms    | 49 Ws     | 21.44%           | 12 ms    | 32 ms      | 0.74 Ws    | 23.04 W            |
+| Alveo U280 | ImageContrastModifier<sup>3</sup> | 3414%           | 13629%          | 150         | 17359 ms  | 1562 Ws   | 21.44%           | 459 ms   | 494 ms     | 11.38 Ws   | 23.04 W            |
+| Alveo U280 | ParallelAlgorithm                 | 226%            | 1858%           | 300         | 362 ms    | 33 Ws     | 10.86%           | 102 ms   | 111 ms     | 1.66 Ws    | 14.99 W            |
+| Alveo U280 | MonteCarloPiEstimator             | 387%            | 2397%           | 230         | 185 ms    | 17 Ws     | 13.63%           | 16 ms    | 38 ms      | 0.67 Ws    | 17.55 W            |
+| Alveo U250 | ImageContrastModifier<sup>1</sup> | 1503%           | 5621%           | 150         | 529 ms    | 48 Ws     | 18.29%           | 13 ms    | 33 ms      | 0.83 Ws    | 25.22 W            |
+| Alveo U250 | ImageContrastModifier<sup>2</sup> | 3268%           | 11921%          | 150         | 193158 ms | 17384 Ws  | 18.29%           | 5535 ms  | 5735 ms    | 144.61 Ws  | 25.22 W            |
+| Alveo U250 | ParallelAlgorithm                 | 357%            | 2437%           | 300         | 498 ms    | 45 Ws     | 10.30%           | 101 ms   | 109 ms     | 1.77 Ws    | 16.21 W            |
+| Alveo U250 | MonteCarloPiEstimator             | 369%            | 2022%           | 230         | 197 ms    | 18 Ws     | 12.39%           | 21 ms    | 42 ms      | 0.84 Ws    | 19.89 W            |
+| Alveo U200 | ImageContrastModifier<sup>1</sup> | 735%            | 3047%           | 150         | 543 ms    | 49 Ws     | 27.23%           | 12 ms    | 65 ms      | 1.55 Ws    | 23.89 W            |
+| Alveo U200 | ImageContrastModifier<sup>2</sup> | 3448%           | 13266%          | 150         | 198172 ms | 17835 Ws  | 27.23%           | 5340 ms  | 5586 ms    | 133.44 Ws  | 23.89 W            |
+| Alveo U200 | ParallelAlgorithm                 | 171%            | 1464%           | 300         | 379 ms    | 34 Ws     | 15.56%           | 110 ms   | 140 ms     | 2.18 Ws    | 15.58 W            |
+| Alveo U200 | MonteCarloPiEstimator             | 203%            | 1333%           | 230         | 203 ms    | 18 Ws     | 18.57%           | 17 ms    | 67 ms      | 1.28 Ws    | 19.04 W            |
+| Alveo U50  | ImageContrastModifier<sup>1</sup> | 1324%           | 6359%           | 150         | 470 ms    | 42 Ws     | 32.09%           | 12 ms    | 33 ms      | 0.65 Ws    | 19.85 W            |
+| Alveo U50  | ImageContrastModifier<sup>3</sup> | 3462%           | 16052%          | 150         | 17167 ms  | 1545 Ws   | 32.09%           | 450 ms   | 482 ms     | 9.57 Ws    | 19.85 W            |
+| Alveo U50  | ParallelAlgorithm                 | 258%            | 2653%           | 300         | 379 ms    | 34 Ws     | 16.22%           | 104 ms   | 106 ms     | 1.24 Ws    | 11.69 W            |
+| Alveo U50  | MonteCarloPiEstimator             | 348%            | 2693%           | 230         | 197 ms    | 18 Ws     | 20.37%           | 18 ms    | 44 ms      | 0.63 Ws    | 14.43 W            |

Aha, hát ez poén. OK, egyelőre jó lesz, aztán majd a 2020.2-vel újrapróbálkozhatunk.

DAud-IcI

comment created time in 4 days

push eventLombiq/Hastlayer-SDK

Dávid El-Saig

commit sha 9ca81fe8324368dbe82eaf819b18a62f6a1a64d9

Add note on 2019.2 compiler platform version.

view details

push time in 4 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here are some basic performance benchmarks on how Hastlayer-accelerated code compares to standard .NET. Since with FPGAs you're not running a program on a processor like a CPU or GPU but rather you create a processor out of your algorithm direct comparisons are hard. Nevertheless, here we tried to compare FPGAs and host PCs (CPUs) with roughly on the same level (e.g. comparing a mid-tier CPU to a mid-tier FPGA). All the algorithms are samples in the Hastlayer solution and available for you to check.
 
 
-## Notes on the hardware used
-
-- "Vitis": [Xilinx Vitis Unified Software Platform](https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html) cards were used (eg. [Alveo U280 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u280.html)).
-- "Catapult": [Microsoft Project Catapult](https://www.microsoft.com/en-us/research/project/project-catapult/) servers used via the [Project Catapult Academic Program](https://www.microsoft.com/en-us/research/academic-program/project-catapult-academic-program/). These contain the following hardware:
-    - FPGA: Mt Granite card with an Altera Stratix V 5SGSMD5H2F35 FPGA and two channels of 4 GB DDR3 RAM, connected to the host via PCIe Gen3 x8. Main clock is 150 Mhz, power consumption is at most 29 W (source: "[A Cloud-Scale Acceleration Architecture](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/Cloud-Scale-Acceleration-Architecture.pdf)").
-    - Host PC: 2 x Intel Xeon E5-2450 CPUs with 16 physical, 32 logical cores each, with a base clock of 2.1 GHz. Power consumption is around 95 W under load (based on [the processor's TDP](https://ark.intel.com/content/www/us/en/ark/products/64611/intel-xeon-processor-e5-2450-20m-cache-2-10-ghz-8-00-gt-s-intel-qpi.html); this is just a rough number and power draw is likely larger when the CPU increases its clock speed under load)
-- "i7 CPU": Intel Core i7-960 CPU with 4 physical, 8 logical cores and a base clock of 3.2 Ghz. Power consumption is around 130 W under load (based on [the processor's TDP](https://ark.intel.com/content/www/us/en/ark/products/37151/intel-core-i7-960-processor-8m-cache-3-20-ghz-4-80-gt-s-intel-qpi.html)).
-- "Nexys": [Nexys A7-100T FPGA board](https://store.digilentinc.com/nexys-a7-fpga-trainer-board-recommended-for-ece-curriculum/) with a Xilinx XC7A100T-1CSG324C FPGA of the Artix-7 family, with 110 MB of user-accessible DDR2 RAM. Main clock is 100 Mhz, power consumption is at most about 2.5 W (corresponding to the maximal power draw via a USB 2.0 port). The communication channel used was the serial one: Virtual serial port via USB 2.0 with a baud rate of 230400 b/s.
-
-
-## Measurements
+## Notes on measurements
 
 Here you can find some measurements of execution times of various algorithms on different platforms. Note:
 
 - Measurements were made with binaries built in Release mode.
+- Measurements were always taken from the second or third run of the application to disregard initialization time. 
 - Figures are rounded to the nearest integer.
 - Speed and power advantage means the execution time and power consumption advantage of the Hastlayer-accelerated FPGA implementation. E.g. a 100% speed advantage means that the Hastlayer-accelerated implementation took half the time to finish than the original CPU one.
 - Degree of parallelism indicates the level of parallelization in the algorithm (typically the number of concurrent `Task`s). For details on these check out the source of the respective algorithm (these are all classes in the Hastlayer SDK's solution). Note that the degree of parallelism, as indicated, differs between platforms. On every platform the highest possible parallelism was used (so CPUs were under full load and thus peak power consumption can be reasonably assumed).
 - For CPU execution always the lowest achieved number is used to disregard noise. So this is an optimistic approach and in real life the CPU executions will most possibly be slower.
-- Power consumption is an approximation based on hardware details above. For PCs it only contains the power consumption of the CPU(s). For FPGA measurements the "total" time is used (though presumably when just communication is running the power consumption is much lower than when computations are being executed).
+- Power consumption is an approximation based on hardware details. For PCs it only contains the power consumption of the CPU(s). For FPGA measurements the "total" time is used (though presumably when just communication is running the power consumption is much lower than when computations are being executed).
 - FPGA resource utilization figures are based on the "main" resource's utilization with all other resource types assumed to be below 100%. For Xilinx FPGAs the main resource type is LUT, for Intel (Altera) ones ALM.
 - For FPGA measurements "total" means the total execution time, including the communication latency of the FPGA; since this varies because of the host PC's load the lowest achieved number is used. "Net" means just the execution of the algorithm itself on the FPGA, not including the time it took to send data to and receive from the device; FPGA execution time is deterministic and doesn't vary significantly. With faster communication channels "total" can be closer to "net". If the input and output data is small then the two measurements will practically be the same.
 
-### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+## Vitis
 
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280/250/200/50) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) instance.
 
-| Algorithm             | Speed advantage | Power advantage |   Parallelism  |   CPU  | CPU power | FPGA utilization | Net FPGA | Total FPGA | FPGA power |
-|:----------------------|:---------------:|:---------------:|:--------------:|:------:|:---------:|:----------------:|:--------:|:----------:|:----------:|
-| ImageContrastModifier |       568%      |       ???%      |        25      | 568 ms |   ?? Ws   |        ??%       |   28 ms  |    85 ms   |   ?? Ws    |
-| ImageContrastModifier |       620%      |       ???%      | 150<sup>1</sup>| 568 ms |   ?? Ws   |        ??%       |   24 ms  |    79 ms   |   ?? Ws    |
+### Details
+
+- FPGA: The following [Xilinx Vitis Unified Software Platform](https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html) cards were used:
+  - [Alveo U280 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u280.html), PCI Express® Gen3 x16, 225 W Maximum Total Power
+  - [Alveo U250 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u250.html), PCI Express® Gen3 x16, 225 W Maximum Total Power
+  - [Alveo U200 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u200.html), PCI Express® Gen3 x16, 225 W Maximum Total Power
+  - [Alveo U50 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u50.html), PCI Express® Gen3 x16, 75 W Maximum Total Power
+- Host: A [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance with 16 x Intel Xeon E5-2640 v3 CPUs with 8 physical, 16 logical cores each, with a base clock of 2.6 GHz. Power consumption is around 90 W under load (based on the processor's TDP, [see here](https://ark.intel.com/content/www/us/en/ark/products/83359/intel-xeon-processor-e5-2640-v3-20m-cache-2-60-ghz.html); the power draw is likely larger when the CPU increases its clock speed under load).
+- Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](Attachments/BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).
+
+### Measurements
+
+| Device     | Algorithm                         | Speed advantage | Power advantage | Parallelism | CPU       | CPU power | FPGA utilization | Net FPGA | Total FPGA | FPGA power | FPGA on-chip power |
+|------------|-----------------------------------|-----------------|-----------------|-------------|-----------|-----------|------------------|----------|------------|------------|--------------------|
+| Alveo U280 | ImageContrastModifier<sup>1</sup> | 1591%           | 6505%           | 150         | 541 ms    | 49 Ws     | 21.44%           | 12 ms    | 32 ms      | 0.74 Ws    | 23.04 W            |
+| Alveo U280 | ImageContrastModifier<sup>3</sup> | 3414%           | 13629%          | 150         | 17359 ms  | 1562 Ws   | 21.44%           | 459 ms   | 494 ms     | 11.38 Ws   | 23.04 W            |
+| Alveo U280 | ParallelAlgorithm                 | 226%            | 1858%           | 300         | 362 ms    | 33 Ws     | 10.86%           | 102 ms   | 111 ms     | 1.66 Ws    | 14.99 W            |
+| Alveo U280 | MonteCarloPiEstimator             | 387%            | 2397%           | 230         | 185 ms    | 17 Ws     | 13.63%           | 16 ms    | 38 ms      | 0.67 Ws    | 17.55 W            |
+| Alveo U250 | ImageContrastModifier<sup>1</sup> | 1503%           | 5621%           | 150         | 529 ms    | 48 Ws     | 18.29%           | 13 ms    | 33 ms      | 0.83 Ws    | 25.22 W            |
+| Alveo U250 | ImageContrastModifier<sup>2</sup> | 3268%           | 11921%          | 150         | 193158 ms | 17384 Ws  | 18.29%           | 5535 ms  | 5735 ms    | 144.61 Ws  | 25.22 W            |
+| Alveo U250 | ParallelAlgorithm                 | 357%            | 2437%           | 300         | 498 ms    | 45 Ws     | 10.30%           | 101 ms   | 109 ms     | 1.77 Ws    | 16.21 W            |
+| Alveo U250 | MonteCarloPiEstimator             | 369%            | 2022%           | 230         | 197 ms    | 18 Ws     | 12.39%           | 21 ms    | 42 ms      | 0.84 Ws    | 19.89 W            |
+| Alveo U200 | ImageContrastModifier<sup>1</sup> | 735%            | 3047%           | 150         | 543 ms    | 49 Ws     | 27.23%           | 12 ms    | 65 ms      | 1.55 Ws    | 23.89 W            |
+| Alveo U200 | ImageContrastModifier<sup>2</sup> | 3448%           | 13266%          | 150         | 198172 ms | 17835 Ws  | 27.23%           | 5340 ms  | 5586 ms    | 133.44 Ws  | 23.89 W            |
+| Alveo U200 | ParallelAlgorithm                 | 171%            | 1464%           | 300         | 379 ms    | 34 Ws     | 15.56%           | 110 ms   | 140 ms     | 2.18 Ws    | 15.58 W            |
+| Alveo U200 | MonteCarloPiEstimator             | 203%            | 1333%           | 230         | 203 ms    | 18 Ws     | 18.57%           | 17 ms    | 67 ms      | 1.28 Ws    | 19.04 W            |
+| Alveo U50  | ImageContrastModifier<sup>1</sup> | 1324%           | 6359%           | 150         | 470 ms    | 42 Ws     | 32.09%           | 12 ms    | 33 ms      | 0.65 Ws    | 19.85 W            |
+| Alveo U50  | ImageContrastModifier<sup>3</sup> | 3462%           | 16052%          | 150         | 17167 ms  | 1545 Ws   | 32.09%           | 450 ms   | 482 ms     | 9.57 Ws    | 19.85 W            |
+| Alveo U50  | ParallelAlgorithm                 | 258%            | 2653%           | 300         | 379 ms    | 34 Ws     | 16.22%           | 104 ms   | 106 ms     | 1.24 Ws    | 11.69 W            |
+| Alveo U50  | MonteCarloPiEstimator             | 348%            | 2693%           | 230         | 197 ms    | 18 Ws     | 20.37%           | 18 ms    | 44 ms      | 0.63 Ws    | 14.43 W            |

Ezzel a binaryval nekem is ugyanazok a számok jöttek ki mint neked. A ciklus szám nagyjából megegyezik (itt 4393400, az benchmark verzióban pedig 4393419), viszont a frekvencia nagyon eltér (itt 293 MHz, ott csak 241 MHz volt) így a kapott ms is eltér. Ennek szerintem az az oka, hogy a Wigneres cluster1-en ahol a build történt 2019.2 verzió van. Gondolom optimalizálták a compilerüket a 2020.1-es platformokkal így nagyobb frekvencián is tud futni. Eszembe se jutott, hogy egyáltalán van különbség nemhogy ilyen nagy!

Nem meglepően az áramfogyasztás egy kicsit magasabb, 16.784 W on-chip power, de a kisebb totál itő miatt az összfogyasztás javul. Érdekes módon az FPGA Utilization viszont egy nagyon kicsit rosszabb, 20.37% helyett 20.35%. Azt hinné az ember itt lehet nagyot javítani de úgy látszik nem.

Kiegészítem a doksit:

  • The kernels were built on the initial 2019.2 version of the Vitis Unified Software Platform. We have seen in one case 20% improvement in frequency (leading to shorter run times and lower total power consumption) by compiling with the newer 2020.1 version.
DAud-IcI

comment created time in 4 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here you can find some measurements of execution times of various algorithms on 
 ### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).

Aha, köszi. Szóval akkor ez az elküldés nem spórolható meg és OpenCL-ben mindenki mindenhol azt csinálja, hogy minden hívás előtt újra elküldi?

DAud-IcI

comment created time in 4 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here are some basic performance benchmarks on how Hastlayer-accelerated code compares to standard .NET. Since with FPGAs you're not running a program on a processor like a CPU or GPU but rather you create a processor out of your algorithm direct comparisons are hard. Nevertheless, here we tried to compare FPGAs and host PCs (CPUs) with roughly on the same level (e.g. comparing a mid-tier CPU to a mid-tier FPGA). All the algorithms are samples in the Hastlayer solution and available for you to check.
 
 
-## Notes on the hardware used
-
-- "Vitis": [Xilinx Vitis Unified Software Platform](https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html) cards were used (eg. [Alveo U280 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u280.html)).
-- "Catapult": [Microsoft Project Catapult](https://www.microsoft.com/en-us/research/project/project-catapult/) servers used via the [Project Catapult Academic Program](https://www.microsoft.com/en-us/research/academic-program/project-catapult-academic-program/). These contain the following hardware:
-    - FPGA: Mt Granite card with an Altera Stratix V 5SGSMD5H2F35 FPGA and two channels of 4 GB DDR3 RAM, connected to the host via PCIe Gen3 x8. Main clock is 150 Mhz, power consumption is at most 29 W (source: "[A Cloud-Scale Acceleration Architecture](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/Cloud-Scale-Acceleration-Architecture.pdf)").
-    - Host PC: 2 x Intel Xeon E5-2450 CPUs with 16 physical, 32 logical cores each, with a base clock of 2.1 GHz. Power consumption is around 95 W under load (based on [the processor's TDP](https://ark.intel.com/content/www/us/en/ark/products/64611/intel-xeon-processor-e5-2450-20m-cache-2-10-ghz-8-00-gt-s-intel-qpi.html); this is just a rough number and power draw is likely larger when the CPU increases its clock speed under load)
-- "i7 CPU": Intel Core i7-960 CPU with 4 physical, 8 logical cores and a base clock of 3.2 Ghz. Power consumption is around 130 W under load (based on [the processor's TDP](https://ark.intel.com/content/www/us/en/ark/products/37151/intel-core-i7-960-processor-8m-cache-3-20-ghz-4-80-gt-s-intel-qpi.html)).
-- "Nexys": [Nexys A7-100T FPGA board](https://store.digilentinc.com/nexys-a7-fpga-trainer-board-recommended-for-ece-curriculum/) with a Xilinx XC7A100T-1CSG324C FPGA of the Artix-7 family, with 110 MB of user-accessible DDR2 RAM. Main clock is 100 Mhz, power consumption is at most about 2.5 W (corresponding to the maximal power draw via a USB 2.0 port). The communication channel used was the serial one: Virtual serial port via USB 2.0 with a baud rate of 230400 b/s.
-
-
-## Measurements
+## Notes on measurements
 
 Here you can find some measurements of execution times of various algorithms on different platforms. Note:
 
 - Measurements were made with binaries built in Release mode.
+- Measurements were always taken from the second or third run of the application to disregard initialization time. 
 - Figures are rounded to the nearest integer.
 - Speed and power advantage means the execution time and power consumption advantage of the Hastlayer-accelerated FPGA implementation. E.g. a 100% speed advantage means that the Hastlayer-accelerated implementation took half the time to finish than the original CPU one.
 - Degree of parallelism indicates the level of parallelization in the algorithm (typically the number of concurrent `Task`s). For details on these check out the source of the respective algorithm (these are all classes in the Hastlayer SDK's solution). Note that the degree of parallelism, as indicated, differs between platforms. On every platform the highest possible parallelism was used (so CPUs were under full load and thus peak power consumption can be reasonably assumed).
 - For CPU execution always the lowest achieved number is used to disregard noise. So this is an optimistic approach and in real life the CPU executions will most possibly be slower.
-- Power consumption is an approximation based on hardware details above. For PCs it only contains the power consumption of the CPU(s). For FPGA measurements the "total" time is used (though presumably when just communication is running the power consumption is much lower than when computations are being executed).
+- Power consumption is an approximation based on hardware details. For PCs it only contains the power consumption of the CPU(s). For FPGA measurements the "total" time is used (though presumably when just communication is running the power consumption is much lower than when computations are being executed).
 - FPGA resource utilization figures are based on the "main" resource's utilization with all other resource types assumed to be below 100%. For Xilinx FPGAs the main resource type is LUT, for Intel (Altera) ones ALM.
 - For FPGA measurements "total" means the total execution time, including the communication latency of the FPGA; since this varies because of the host PC's load the lowest achieved number is used. "Net" means just the execution of the algorithm itself on the FPGA, not including the time it took to send data to and receive from the device; FPGA execution time is deterministic and doesn't vary significantly. With faster communication channels "total" can be closer to "net". If the input and output data is small then the two measurements will practically be the same.
 
-### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+## Vitis
 
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280/250/200/50) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) instance.
 
-| Algorithm             | Speed advantage | Power advantage |   Parallelism  |   CPU  | CPU power | FPGA utilization | Net FPGA | Total FPGA | FPGA power |
-|:----------------------|:---------------:|:---------------:|:--------------:|:------:|:---------:|:----------------:|:--------:|:----------:|:----------:|
-| ImageContrastModifier |       568%      |       ???%      |        25      | 568 ms |   ?? Ws   |        ??%       |   28 ms  |    85 ms   |   ?? Ws    |
-| ImageContrastModifier |       620%      |       ???%      | 150<sup>1</sup>| 568 ms |   ?? Ws   |        ??%       |   24 ms  |    79 ms   |   ?? Ws    |
+### Details
+
+- FPGA: The following [Xilinx Vitis Unified Software Platform](https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html) cards were used:
+  - [Alveo U280 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u280.html), PCI Express® Gen3 x16, 225 W Maximum Total Power
+  - [Alveo U250 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u250.html), PCI Express® Gen3 x16, 225 W Maximum Total Power
+  - [Alveo U200 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u200.html), PCI Express® Gen3 x16, 225 W Maximum Total Power
+  - [Alveo U50 Data Center Accelerator Card](https://www.xilinx.com/products/boards-and-kits/alveo/u50.html), PCI Express® Gen3 x16, 75 W Maximum Total Power
+- Host: A [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance with 16 x Intel Xeon E5-2640 v3 CPUs with 8 physical, 16 logical cores each, with a base clock of 2.6 GHz. Power consumption is around 90 W under load (based on the processor's TDP, [see here](https://ark.intel.com/content/www/us/en/ark/products/83359/intel-xeon-processor-e5-2640-v3-20m-cache-2-60-ghz.html); the power draw is likely larger when the CPU increases its clock speed under load).
+- Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](Attachments/BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).
+
+### Measurements
+
+| Device     | Algorithm                         | Speed advantage | Power advantage | Parallelism | CPU       | CPU power | FPGA utilization | Net FPGA | Total FPGA | FPGA power | FPGA on-chip power |
+|------------|-----------------------------------|-----------------|-----------------|-------------|-----------|-----------|------------------|----------|------------|------------|--------------------|
+| Alveo U280 | ImageContrastModifier<sup>1</sup> | 1591%           | 6505%           | 150         | 541 ms    | 49 Ws     | 21.44%           | 12 ms    | 32 ms      | 0.74 Ws    | 23.04 W            |
+| Alveo U280 | ImageContrastModifier<sup>3</sup> | 3414%           | 13629%          | 150         | 17359 ms  | 1562 Ws   | 21.44%           | 459 ms   | 494 ms     | 11.38 Ws   | 23.04 W            |
+| Alveo U280 | ParallelAlgorithm                 | 226%            | 1858%           | 300         | 362 ms    | 33 Ws     | 10.86%           | 102 ms   | 111 ms     | 1.66 Ws    | 14.99 W            |
+| Alveo U280 | MonteCarloPiEstimator             | 387%            | 2397%           | 230         | 185 ms    | 17 Ws     | 13.63%           | 16 ms    | 38 ms      | 0.67 Ws    | 17.55 W            |
+| Alveo U250 | ImageContrastModifier<sup>1</sup> | 1503%           | 5621%           | 150         | 529 ms    | 48 Ws     | 18.29%           | 13 ms    | 33 ms      | 0.83 Ws    | 25.22 W            |
+| Alveo U250 | ImageContrastModifier<sup>2</sup> | 3268%           | 11921%          | 150         | 193158 ms | 17384 Ws  | 18.29%           | 5535 ms  | 5735 ms    | 144.61 Ws  | 25.22 W            |
+| Alveo U250 | ParallelAlgorithm                 | 357%            | 2437%           | 300         | 498 ms    | 45 Ws     | 10.30%           | 101 ms   | 109 ms     | 1.77 Ws    | 16.21 W            |
+| Alveo U250 | MonteCarloPiEstimator             | 369%            | 2022%           | 230         | 197 ms    | 18 Ws     | 12.39%           | 21 ms    | 42 ms      | 0.84 Ws    | 19.89 W            |
+| Alveo U200 | ImageContrastModifier<sup>1</sup> | 735%            | 3047%           | 150         | 543 ms    | 49 Ws     | 27.23%           | 12 ms    | 65 ms      | 1.55 Ws    | 23.89 W            |
+| Alveo U200 | ImageContrastModifier<sup>2</sup> | 3448%           | 13266%          | 150         | 198172 ms | 17835 Ws  | 27.23%           | 5340 ms  | 5586 ms    | 133.44 Ws  | 23.89 W            |
+| Alveo U200 | ParallelAlgorithm                 | 171%            | 1464%           | 300         | 379 ms    | 34 Ws     | 15.56%           | 110 ms   | 140 ms     | 2.18 Ws    | 15.58 W            |
+| Alveo U200 | MonteCarloPiEstimator             | 203%            | 1333%           | 230         | 203 ms    | 18 Ws     | 18.57%           | 17 ms    | 67 ms      | 1.28 Ws    | 19.04 W            |
+| Alveo U50  | ImageContrastModifier<sup>1</sup> | 1324%           | 6359%           | 150         | 470 ms    | 42 Ws     | 32.09%           | 12 ms    | 33 ms      | 0.65 Ws    | 19.85 W            |
+| Alveo U50  | ImageContrastModifier<sup>3</sup> | 3462%           | 16052%          | 150         | 17167 ms  | 1545 Ws   | 32.09%           | 450 ms   | 482 ms     | 9.57 Ws    | 19.85 W            |
+| Alveo U50  | ParallelAlgorithm                 | 258%            | 2653%           | 300         | 379 ms    | 34 Ws     | 16.22%           | 104 ms   | 106 ms     | 1.24 Ws    | 11.69 W            |
+| Alveo U50  | MonteCarloPiEstimator             | 348%            | 2693%           | 230         | 197 ms    | 18 Ws     | 20.37%           | 18 ms    | 44 ms      | 0.63 Ws    | 14.43 W            |

Próbaképp lefuttattam ezt most egy U50 x16-os gépen, és nekem 15ms (14,9946) net, 33ms (minimum) total jöt ki Debug mode-ban. Az FPGA az determinisztikus ugye és mindig pontosan ugyanaz jön ki, szóval itt valami nem kerek. Az odáig OK, hogy nem lesz us-ra ugyanannyi, mert van kis eltérés a hardverben, de nálad 20%-ka nagyobb futásidő jött ki, ami ezzel nem magyarázható (vagy nagyon tré a Xilinx gyártása).

Nézd meg kérlek, hogy mi lehet az oka. Íme az egész mappa, szoftver (előre beconfigolva, szal csak dotnet Hast.Samples.Consumer.dll) és benna HWF mappa is. Hastlayer.zip

CPU-n amúgy kb. 300ms, de ez nem túl meglepő a Debug mode miatt.

DAud-IcI

comment created time in 4 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here you can find some measurements of execution times of various algorithms on 
 ### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).

OpenClCommunicationService.Execute-ban a _binaryOpenCl.CreateBinaryKernel(kernelBinary, KernelName); hozza létre a natív objektumot itt. Abban nem vagyok 100% biztos, hogy már ilyenkor elküldi az eszköznek vagy csak kicsit lejjebb a program elindításakor (_binaryOpenCl.LaunchKernel(deviceIndex, KernelName, new[] { fpgaBuffer });) itt. Azért nem egyértelmű, mert a kernel feltöltését az OpenCL intézi.

DAud-IcI

comment created time in 5 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here you can find some measurements of execution times of various algorithms on 
 ### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).

Hmm, hol van pontosan ez a kernel feltöltés?

DAud-IcI

comment created time in 5 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here you can find some measurements of execution times of various algorithms on 
 ### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).

Attól függ mit értünk PCIe setup alatt. Ez felébreszti a kártyát, ha kell.

A kernel feltöltése viszont más kategória. A communication service minden híváskor létre hozza a kernelt így az appon belüli többszörös futtatás nem érne el semmit. Ez azért van, mert program elindításához (amikor feltöltöd a command queue-ba) el kell küldeni a kernelt is. Ez normális OpenCL pattern. Azt elvileg a xilinx firmware adja, hogy ha az épp betöltött kernel érkezik újra, akkor ne programozza újra a kártyát. De amúgy ha van is emiatt különbség, érzékelhetetlenül kicsi. Amikor futtattam az ImageContrastModifier-t a nagy képpel tízszer egymás után akkor láttam ±0.02% változást, de random bármilyen trend nélkül.

DAud-IcI

comment created time in 5 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 async Task InvocationHandler()                         var memory = (SimpleMemory)invocation.Arguments.SingleOrDefault(argument => argument is SimpleMemory);
                         if (memory != null)
                         {
-                            foreach (var checker in memoryResourceCheckers)
+                            if (memoryResourceCheckers.Check(memory, hardwareRepresentation) is { } problem)
                             {
-                                checker.EnsureResourceAvailable(memory, hardwareRepresentation);
+                                var exception = new InvalidOperationException(
+                                    $"The input is too large to fit into the device's memory. The input is " +
+                                    $"{problem.MemoryByteCount} bytes, the available memory is " +
+                                    $"{problem.AvailableByteCount} bytes. (reported by {problem.GetType().FullName}) " +

Hát csonka mondat is mondat, szóval attól még naggyal írjuk és írásjelet teszünk a végére :). De így is OK.

DAud-IcI

comment created time in 5 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 async Task InvocationHandler()                         var memory = (SimpleMemory)invocation.Arguments.SingleOrDefault(argument => argument is SimpleMemory);
                         if (memory != null)
                         {
-                            foreach (var checker in memoryResourceCheckers)
+                            if (memoryResourceCheckers.Check(memory, hardwareRepresentation) is { } problem)
                             {
-                                checker.EnsureResourceAvailable(memory, hardwareRepresentation);
+                                var exception = new InvalidOperationException(
+                                    $"The input is too large to fit into the device's memory. The input is " +
+                                    $"{problem.MemoryByteCount} bytes, the available memory is " +
+                                    $"{problem.AvailableByteCount} bytes. (reported by {problem.GetType().FullName}) " +

Hát a zárójelen belüli rész nem egy teljes mondat ezért furcsa nagybetűvel kezdeni, viszont egy külön gondolat aminek semmi köze az előző mondathoz ezért nem lenne helyes bevinni a pont elé. se Akkor már inkább

$"{problem.AvailableByteCount} bytes. (Reported by {problem.GetType().FullName}.) "
DAud-IcI

comment created time in 5 days

push eventLombiq/Hastlayer-SDK

Dávid El-Saig

commit sha b07abe948e144bf9dc37d8bb5ec90c16a5143a2f

Alter text.

view details

push time in 5 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 Here you can find some measurements of execution times of various algorithms on 
 ### Vitis
 
-Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2019.2" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity.
+Comparing the performance of a Vitis platform FPGA (Xilinx Alveo U280) to the host PC's performance on a [Nimbix](https://www.nimbix.net/alveo) "Xilinx Vitis Unified Software Platform 2020.1" instance. Only a single CPU is assumed to be running under 100% load for the power usage figures for the sake of simplicity. The table has a matching [Excel sheet](BenchmarksVitis.xlsx) that was converted using [this VS Code extension](https://marketplace.visualstudio.com/items?itemName=csholmq.excel-to-markdown-table).

A PCIe setupra amúgy ez miért megoldás? A JIT-re OK, de itt ugye három különböző alkalomkor fut le az egész app, szóval akármit is tölt be, az közte elveszik. Az appon belül kellene többször futtatni.

DAud-IcI

comment created time in 5 days

Pull request review commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

 async Task InvocationHandler()                         var memory = (SimpleMemory)invocation.Arguments.SingleOrDefault(argument => argument is SimpleMemory);
                         if (memory != null)
                         {
-                            foreach (var checker in memoryResourceCheckers)
+                            if (memoryResourceCheckers.Check(memory, hardwareRepresentation) is { } problem)
                             {
-                                checker.EnsureResourceAvailable(memory, hardwareRepresentation);
+                                var exception = new InvalidOperationException(
+                                    $"The input is too large to fit into the device's memory. The input is " +
+                                    $"{problem.MemoryByteCount} bytes, the available memory is " +
+                                    $"{problem.AvailableByteCount} bytes. (reported by {problem.GetType().FullName}) " +

Ez a ponttal lezárt mondat után kis betűvel kezdünk a zárójelen belül egy visszatérő becsípődés :D.

DAud-IcI

comment created time in 5 days

push eventLombiq/Hastlayer-SDK

Zoltán Lehóczky

commit sha 62bbb70772c7651817893cb870be21dec8f6329e

Code styling, typo

view details

push time in 5 days

push eventLombiq/Hastlayer-SDK

Zoltán Lehóczky

commit sha 428846bbe0518fe48c1e278b1da4b8676639e1a7

Removing <sup>s from the Excel spreadsheet

view details

push time in 5 days

pull request commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

Most már jó, köszi.

DAud-IcI

comment created time in 5 days

pull request commentLombiq/Hastlayer-SDK

HAST-159: New benchmarks for Vitis

Eh sorry, most?

DAud-IcI

comment created time in 5 days

more