API
Functions
- gfloat.decode_float(fi: FormatInfo, i: int) FloatValue[source]
Given
FormatInfoand integer code point, decode to aFloatValue- Parameters:
fi¶ (FormatInfo) – Floating point format descriptor.
i¶ (int) – Integer code point, in the range \(0 \le i < 2^{k}\), where \(k\) =
fi.k
- Returns:
Decoded float value
- Raises:
ValueError – If
iis outside the range of valid code points infi.
- gfloat.round_float(fi: FormatInfo, v: float, rnd: RoundMode = RoundMode.TiesToEven, sat: bool = False) float[source]
Round input to the given
FormatInfo, given rounding mode and saturation flagAn input NaN will convert to a NaN in the target. An input Infinity will convert to the largest float if
sat, otherwise to an Inf, if present, otherwise to a NaN. Negative zero will be returned if the format has negative zero, otherwise zero.- Parameters:
- Returns:
A float which is one of the values in the format.
- Raises:
ValueError – The target format cannot represent the input (e.g. converting a NaN, or an Inf when the target has no NaN or Inf, and
satis false)
- gfloat.encode_float(fi: FormatInfo, v: float) int[source]
Encode input to the given
FormatInfo.Will round toward zero if
vis not in the value set. Will saturate to Inf, NaN, fi.max in order of precedence. Encode -0 to 0 if not fi.has_nzFor other roundings and saturations, call
round_float()first.- Parameters:
fi¶ (FormatInfo) – Describes the target format
v¶ (float) – The value to be encoded.
- Returns:
The integer code point
- gfloat.decode_block(fi: BlockFormatInfo, block: Iterable[int]) Iterable[float][source]
Decode a
blockof integer codepoints in Block FormatfiThe scale is encoded in the first value of
block, with the remaining values encoding the block elements.The size of the iterable is not checked against the format descriptor.
- Parameters:
fi¶ (BlockFormatInfo) – Describes the block format
block¶ (Iterable[int]) – Input block
- Returns:
A sequence of floats representing the encoded values.
- gfloat.encode_block(fi: BlockFormatInfo, scale: float, vals: Iterable[float]) Iterable[int][source]
Encode a
blockof bytes into block Format descibed byfiThe
scaleis explicitly passed, and is converted to 1/(1/scale) before rounding to the target format.It is checked for overflow in the target format, and will raise an exception if it does.
- Parameters:
fi¶ (BlockFormatInfo) – Describes the target block format
scale¶ (float) – Scale to be recorded in the block
vals¶ (Iterable[float]) – Input block
- Returns:
A sequence of ints representing the encoded values.
- Raises:
ValueError – The scale overflows the target scale encoding format.
Classes
- class gfloat.FormatInfo[source]
Class describing a floating-point format, parametrized by width, precision, and special value encoding rules.
- name: str
Short name for the format, e.g. binary32, bfloat16
- k: int
Number of bits in the format
- precision: int
Number of significand bits (including implicit leading bit)
- emax: int
Largest exponent, emax, which shall equal floor(log_2(maxFinite))
- has_nz: bool
Set if format encodes -0 at (sgn=1,exp=0,significand=0). If False, that encoding decodes to a NaN labelled NaN_0
- has_infs: bool
Set if format includes +/- Infinity. If set, the non-nan value with the highest encoding for each sign (s) is replaced by (s)Inf.
- num_high_nans: int
Number of NaNs that are encoded in the highest encodings for each sign
- has_subnormals: bool
Set if format encodes subnormals
- is_signed: bool
Set if the format has a sign bit
- is_twos_complement: bool
Set if the format uses two’s complement encoding for the significand
- property tSignificandBits: int
The number of trailing significand bits, t
- property expBits: int
The number of exponent bits, w
- property signBits: int
The number of sign bits, s
- property expBias: int
The exponent bias derived from (p,emax)
- This is the bias that should be applied so that
\(floor(log_2(maxFinite)) = emax\)
- property bits: int
The number of bits occupied by the type.
- property eps: float
The difference between 1.0 and the next smallest representable float larger than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
eps = 2**-52, approximately 2.22e-16.
- property epsneg: float
The difference between 1.0 and the next smallest representable float less than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
epsneg = 2**-53, approximately 1.11e-16.
- property iexp: int
The number of bits in the exponent portion of the floating point representation.
- property machep: int
The exponent that yields eps.
- property max: float
The largest representable number.
- property maxexp: int
The smallest positive power of the base (2) that causes overflow.
- property min: float
The smallest representable number, typically
-max.
- property num_nans: int
The number of code points which decode to NaN
- property code_of_nan: int
Return a codepoint for a NaN
- property code_of_posinf: int
Return a codepoint for positive infinity
- property code_of_neginf: int
Return a codepoint for negative infinity
- property code_of_zero: int
Return a codepoint for (non-negative) zero
- property has_zero: bool
Does the format have zero?
This is false if the mantissa is 0 width and we don’t have subnormals - essentially the mantissa is always decoded as 1. If we have subnormals, the only subnormal is zero, and the mantissa is always decoded as 0.
- property code_of_negzero: int
Return a codepoint for negative zero
- property code_of_max: int
Return a codepoint for fi.max
- property code_of_min: int
Return a codepoint for fi.min
- property smallest_normal: float
The smallest positive floating point number with 1 as leading bit in the significand following IEEE-754.
- property smallest_subnormal: float
The smallest positive floating point number with 0 as leading bit in the significand following IEEE-754.
- property smallest: float
The smallest positive floating point number.
- property is_all_subnormal: bool
Are all encoded values subnormal?
- class gfloat.FloatClass[source]
Enum for the classification of a FloatValue.
- NORMAL = 1
A positive or negative normalized non-zero value
- SUBNORMAL = 2
A positive or negative subnormal value
- ZERO = 3
A positive or negative zero value
- INFINITE = 4
A positive or negative infinity (+/-Inf)
- NAN = 5
Not a Number (NaN)
- class gfloat.RoundMode[source]
Enum for IEEE-754 rounding modes.
Result r is obtained from input v depending on rounding mode as follows
- TowardZero = 1
\(\max \{ r ~ s.t. ~ |r| \le |v| \}\)
- TowardNegative = 2
\(\max \{ r ~ s.t. ~ r \le v \}\)
- TowardPositive = 3
\(\min \{ r ~ s.t. ~ r \ge v \}\)
- TiesToEven = 4
Round to nearest, ties to even
- TiesToAway = 5
Round to nearest, ties away from zero
- class gfloat.FloatValue[source]
A floating-point value decoded in great detail.
- ival: int
Integer code point
- fval: float
Value. Assumed to be exactly round-trippable to python float. This is true for all <64bit formats known in 2023.
- exp: int
Raw exponent without bias
- expval: int
Exponent, bias subtracted
- significand: int
Significand as an integer
- fsignificand: float
Significand as a float in the range [0,2)
- signbit: int
1 => negative, 0 => positive
- Type:
Sign bit
- fclass: FloatClass
See FloatClass
- class gfloat.BlockFormatInfo[source]
BlockFormatInfo(name: str, etype: gfloat.types.FormatInfo, k: int, stype: gfloat.types.FormatInfo)
- name: str
Short name for the format, e.g. BlockFP8
- etype: FormatInfo
Element data type
- k: int
Scaling block size
- stype: FormatInfo
Scale datatype
- property element_bits: int
The number of bits in each element, d
- property scale_bits: int
The number of bits in the scale, w
- property block_size_bytes: int
The number of bytes in a block
Pretty printers
- gfloat.float_pow2str(v: float, min_exponent: float = -inf) str[source]
Render floating point values as exact fractions times a power of two.
Example: float_pow2str(127.0) is “127/64*2^6”,
That is (a significand between 1 and 2) times (a power of two).
If min_exponent is supplied, then values with exponent below min_exponent, are printed as fractions less than 1, with exponent set to min_exponent. This is typically used to represent subnormal values.