• Calculate the valid roundings when quantizing to 16-bit floats

    TS/JS's number type is internally a f64, so quantization needs to occur when converting to f16 for WGSL. WGSL does not specify a specific rounding mode, so if a number is not precisely representable in 16-bits, but in the range, there are two possible valid quantizations. If it is precisely representable, there is only one valid quantization. This function calculates the valid roundings and returns them in an array.

    This function does not consider flushing mode, so subnormals are maintained. The caller is responsible to flushing before and after as appropriate.

    Out of bounds values need to consider how they interact with the overflow rules.

    • If a value is OOB but not too far out, an implementation may choose to round to nearest finite value or the correct infinity. This boundary is at 2^(f16.emax + 1) and -(2^(f16.emax + 1)) respectively. Values that are at or beyond these limits must be rounded towards the appropriate infinity.


    • n: number

      number to be quantized

    Returns readonly number[]

    all of the acceptable roundings for quantizing to 16-bits in ascending order.

Generated using TypeDoc