Performance
Universal is production-worthy software that is currently integrated in HPC codes ranging from Distributed Memory and hardware-accelerated linear algebra libraries, next-generation FEM codes, and quantum computing simulators. The emulated performance on a commodity CPU is between 1-25% of the peak performance of the native floating-point hardware, as is demonstrated by our performance benchmarking regression suite:
comparative floating-point special value processing performance
-------------------------------------------------------------------------
float zeros 0.0005941sec -> 1 Gops/sec
float ones 0.0006163sec -> 1 Gops/sec
float subnormals 0.0335508sec -> 31 Mops/sec
float Inf 0.0005913sec -> 1 Gops/sec
float NaN 0.0005906sec -> 1 Gops/sec
-------------------------------------------------------------------------
double zeros 0.0006350sec -> 1 Gops/sec
double ones 0.0006246sec -> 1 Gops/sec
double subnormals 0.0336820sec -> 31 Mops/sec
double Inf 0.0006805sec -> 1 Gops/sec
double NaN 0.0012707sec -> 825 Mops/sec
-------------------------------------------------------------------------
long double zeros 0.0021333sec -> 491 Mops/sec
long double ones 0.0021461sec -> 488 Mops/sec
long double subnormals 0.3525970sec -> 2 Mops/sec
long double Inf 0.1954710sec -> 5 Mops/sec
long double NaN 0.2058690sec -> 5 Mops/sec
-------------------------------------------------------------------------
cfloat< 8, 2> zeros 0.0109907sec -> 95 Mops/sec
cfloat< 8, 2> ones 0.0648926sec -> 16 Mops/sec
cfloat< 8, 2> subnormals 0.0716940sec -> 14 Mops/sec
cfloat< 8, 2> Inf 0.0103379sec -> 101 Mops/sec
cfloat< 8, 2> NaN 0.0094938sec -> 110 Mops/sec
-------------------------------------------------------------------------
cfloat< 16, 5> zeros 0.0169976sec -> 61 Mops/sec
cfloat< 16, 5> ones 0.0906621sec -> 11 Mops/sec
cfloat< 16, 5> subnormals 0.1039400sec -> 10 Mops/sec
cfloat< 16, 5> Inf 0.0143550sec -> 73 Mops/sec
cfloat< 16, 5> NaN 0.0120827sec -> 86 Mops/sec
-------------------------------------------------------------------------
cfloat< 32, 8> zeros 0.0103935sec -> 100 Mops/sec
cfloat< 32, 8> ones 0.1565900sec -> 6 Mops/sec
cfloat< 32, 8> subnormals 0.1856190sec -> 5 Mops/sec
cfloat< 32, 8> Inf 0.0080376sec -> 130 Mops/sec
cfloat< 32, 8> NaN 0.0058051sec -> 180 Mops/sec
-------------------------------------------------------------------------
posit< 8,0> zeros 0.1618230sec -> 6 Mops/sec
posit< 8,0> ones 0.5122780sec -> 2 Mops/sec
posit< 8,0> subnormals 0.4846060sec -> 2 Mops/sec
posit< 8,0> Inf 0.4255450sec -> 2 Mops/sec
posit< 8,0> NaN 0.1622420sec -> 6 Mops/sec
-------------------------------------------------------------------------
posit< 16,1> zeros 0.1889870sec -> 5 Mops/sec
posit< 16,1> ones 0.2096740sec -> 5 Mops/sec
posit< 16,1> subnormals 0.2207620sec -> 4 Mops/sec
posit< 16,1> Inf 0.2236020sec -> 4 Mops/sec
posit< 16,1> NaN 0.1900360sec -> 5 Mops/sec
-------------------------------------------------------------------------
posit< 32,2> zeros 0.2458460sec -> 4 Mops/sec
posit< 32,2> ones 0.2558330sec -> 4 Mops/sec
posit< 32,2> subnormals 0.2751780sec -> 3 Mops/sec
posit< 32,2> Inf 0.2788860sec -> 3 Mops/sec
posit< 32,2> NaN 0.2574310sec -> 4 Mops/sec
-------------------------------------------------------------------------
posit< 64,3> zeros 0.4248180sec -> 2 Mops/sec
posit< 64,3> ones 2.9339300sec -> 357 Kops/sec
posit< 64,3> subnormals 2.0671000sec -> 507 Kops/sec
posit< 64,3> Inf 2.2326000sec -> 469 Kops/sec
posit< 64,3> NaN 0.3717860sec -> 2 Mops/sec
-------------------------------------------------------------------------
posit<128,4> zeros 0.5575080sec -> 1 Mops/sec
posit<128,4> ones 5.5065500sec -> 190 Kops/sec
posit<128,4> subnormals 3.9879900sec -> 262 Kops/sec
posit<128,4> Inf 4.2924700sec -> 244 Kops/sec
posit<128,4> NaN 0.5599230sec -> 1 Mops/sec
-------------------------------------------------------------------------
posit<256,5> zeros 0.9279230sec -> 1 Mops/sec
posit<256,5> ones 10.188500sec -> 102 Kops/sec
posit<256,5> subnormals 7.3882600sec -> 141 Kops/sec
posit<256,5> Inf 7.9310700sec -> 132 Kops/sec
posit<256,5> NaN 0.9320970sec -> 1 Mops/sec
However, the ultimate target for the mixed-precision algorithm designed and validated with Universal is to run on FPGA and custom ASIC hardware. Our early prototype hardware designs are targeting ~5TOPS on commodity FPGAs, and >100TOPS on custom ASIC designs.
Last updated