cfloat<nbits, es>

Arbitrary, fixed-size, classic floating-point with arithmetic exceptions

cfloat<nbits, es> is an arbitrary, fixed-size, floating-point type.

The type definition for arbitrary, fixed-size floating-points in Universal is:

template<size_t nbits, size_t es, typename bt = uint8_t,
	bool hasSubnormals = false, bool hasSupernormals = false, bool isSaturating = false>
class cfloat;

The cfloat floating-point type is parameterized in the number of bits in the encoding, the number of exponent bits to use to represent scale, conditional support for subnormals and supernormals, the type of arithmetic (Clipping or Saturating), and the block type to use in the representation.

The type will automatically allocate the minimum number of blocks to represent the cfloat. Effectively, the size of the block type will define the memory alignment of cfloat values in arrays and vectors.

Subnormals are encodings that do not have an implicit hidden bit and are encoded with an exponent field that is 0. In IEEE-754 floating-point, special values such as INFINITE and Not-a-Number are encoded with an exponent field with all bits set. This design wastes many possible value encodings. This is particularly egregious for small exponent fields. For example, when es = 2, 25% of all the encodings would be unused.

To create useful encodings for small configurations, cfloat offers supernormals, modeled symmetrically to subnormals, encoded when all exponent bits are set. The cfloat type uses only two encodings for +-INFINITY, and two encoding for NaN (signaling and quiet).

By default, both subnormals and supernormals are not enabled, reflecting typical CPU, GPU, and FPGA DSP-block hardware configurations. CPUs and GPUS tend to use software emulation for subnormals, and lack supernormals. Also by default, cfloat offer clipping arithmetic to +-INFINITY.

The cfloat type can be compiled with or without exceptions using a compile guard: CFLOAT_THROW_ARITHMETIC_EXCEPTION, to be set before including the type in your module, like so:

#define CFLOAT_THROW_ARITHMETIC_EXCEPTION 1
#include <universal/number/cfloat/cfloat.hpp>

By default, exceptions are not enabled. They are defined, but not thrown, so client code will work in either configuration.

The cfloat floating-point type offers the full spectrum of floating-point sizes and is effective for transforming fixed-point algorithms when their dynamic range needs to be expanded, for example in beamforming and MIMO.

As small cfloats with subnormals and supernormals enabled offer symmetry and a full set of value encodings, types such as cfloat<8,2> and cfloat<9,3> are now effective to explore memory bandwidth and energy optimizations for high-performance deep learning. Furthermore, a cfloat<16,5> would be equivalent to IEEE half-precsion fp16, and cfloat<16,8> is equivalent to Google's bfloat16. The formats are depicted below:

For computational mathematics, configurations, such as cfloat<80,11>, provide direct access to extended precision types that are at play in most CPUs. And for numerical applications that require an oracle, a configuration such as cfloat<128,15> would be useful for computations that exhibit large dynamic range, or cfloat<192,8>, for computations that need high precision. There is no limit to nbits, only time and space on your computer.

Previousfixpnt<nbits, rbits>Nextposit<nbits, es>

Last updated 3 years ago