Performance

Universal is production-worthy software that is currently integrated in HPC codes ranging from Distributed Memory and hardware-accelerated linear algebra libraries, next-generation FEM codes, and quantum computing simulators. The emulated performance on a commodity CPU is between 1-25% of the peak performance of the native floating-point hardware, as is demonstrated by our performance benchmarking regression suite:

   comparative floating-point special value processing performance
-------------------------------------------------------------------------
float                    zeros              0.0005941sec ->   1 Gops/sec
float                    ones               0.0006163sec ->   1 Gops/sec
float                    subnormals         0.0335508sec ->  31 Mops/sec
float                    Inf                0.0005913sec ->   1 Gops/sec
float                    NaN                0.0005906sec ->   1 Gops/sec
-------------------------------------------------------------------------
double                   zeros              0.0006350sec ->   1 Gops/sec
double                   ones               0.0006246sec ->   1 Gops/sec
double                   subnormals         0.0336820sec ->  31 Mops/sec
double                   Inf                0.0006805sec ->   1 Gops/sec
double                   NaN                0.0012707sec -> 825 Mops/sec
-------------------------------------------------------------------------
long double              zeros              0.0021333sec -> 491 Mops/sec
long double              ones               0.0021461sec -> 488 Mops/sec
long double              subnormals         0.3525970sec ->   2 Mops/sec
long double              Inf                0.1954710sec ->   5 Mops/sec
long double              NaN                0.2058690sec ->   5 Mops/sec
-------------------------------------------------------------------------
cfloat<  8, 2>           zeros              0.0109907sec ->  95 Mops/sec
cfloat<  8, 2>           ones               0.0648926sec ->  16 Mops/sec
cfloat<  8, 2>           subnormals         0.0716940sec ->  14 Mops/sec
cfloat<  8, 2>           Inf                0.0103379sec -> 101 Mops/sec
cfloat<  8, 2>           NaN                0.0094938sec -> 110 Mops/sec
-------------------------------------------------------------------------
cfloat< 16, 5>           zeros              0.0169976sec ->  61 Mops/sec
cfloat< 16, 5>           ones               0.0906621sec ->  11 Mops/sec
cfloat< 16, 5>           subnormals         0.1039400sec ->  10 Mops/sec
cfloat< 16, 5>           Inf                0.0143550sec ->  73 Mops/sec
cfloat< 16, 5>           NaN                0.0120827sec ->  86 Mops/sec
-------------------------------------------------------------------------
cfloat< 32, 8>           zeros              0.0103935sec -> 100 Mops/sec
cfloat< 32, 8>           ones               0.1565900sec ->   6 Mops/sec
cfloat< 32, 8>           subnormals         0.1856190sec ->   5 Mops/sec
cfloat< 32, 8>           Inf                0.0080376sec -> 130 Mops/sec
cfloat< 32, 8>           NaN                0.0058051sec -> 180 Mops/sec
-------------------------------------------------------------------------
posit<  8,0>             zeros              0.1618230sec ->   6 Mops/sec
posit<  8,0>             ones               0.5122780sec ->   2 Mops/sec
posit<  8,0>             subnormals         0.4846060sec ->   2 Mops/sec
posit<  8,0>             Inf                0.4255450sec ->   2 Mops/sec
posit<  8,0>             NaN                0.1622420sec ->   6 Mops/sec
-------------------------------------------------------------------------
posit< 16,1>             zeros              0.1889870sec ->   5 Mops/sec
posit< 16,1>             ones               0.2096740sec ->   5 Mops/sec
posit< 16,1>             subnormals         0.2207620sec ->   4 Mops/sec
posit< 16,1>             Inf                0.2236020sec ->   4 Mops/sec
posit< 16,1>             NaN                0.1900360sec ->   5 Mops/sec
-------------------------------------------------------------------------
posit< 32,2>             zeros              0.2458460sec ->   4 Mops/sec
posit< 32,2>             ones               0.2558330sec ->   4 Mops/sec
posit< 32,2>             subnormals         0.2751780sec ->   3 Mops/sec
posit< 32,2>             Inf                0.2788860sec ->   3 Mops/sec
posit< 32,2>             NaN                0.2574310sec ->   4 Mops/sec
-------------------------------------------------------------------------
posit< 64,3>             zeros              0.4248180sec ->   2 Mops/sec
posit< 64,3>             ones               2.9339300sec -> 357 Kops/sec
posit< 64,3>             subnormals         2.0671000sec -> 507 Kops/sec
posit< 64,3>             Inf                2.2326000sec -> 469 Kops/sec
posit< 64,3>             NaN                0.3717860sec ->   2 Mops/sec
-------------------------------------------------------------------------
posit<128,4>             zeros              0.5575080sec ->   1 Mops/sec
posit<128,4>             ones               5.5065500sec -> 190 Kops/sec
posit<128,4>             subnormals         3.9879900sec -> 262 Kops/sec
posit<128,4>             Inf                4.2924700sec -> 244 Kops/sec
posit<128,4>             NaN                0.5599230sec ->   1 Mops/sec
-------------------------------------------------------------------------
posit<256,5>             zeros              0.9279230sec ->   1 Mops/sec
posit<256,5>             ones               10.188500sec -> 102 Kops/sec
posit<256,5>             subnormals         7.3882600sec -> 141 Kops/sec
posit<256,5>             Inf                7.9310700sec -> 132 Kops/sec
posit<256,5>             NaN                0.9320970sec ->   1 Mops/sec

However, the ultimate target for the mixed-precision algorithm designed and validated with Universal is to run on FPGA and custom ASIC hardware. Our early prototype hardware designs are targeting ~5TOPS on commodity FPGAs, and >100TOPS on custom ASIC designs.

Last updated